Michael Mendoza
Neil Samuel Pulukuri
Gnana Chaithanya Rawali Male
Abinav Yadamani
Launching innovative products is crucial for beverage companies to drive revenue growth, diversify the product portfolio, attract broader consumer categories, and strengthen brand visibility and loyalty. These products will also cater to recent trends over prioritizing health and sustainability. Innovative products also boosts profit as these are offered at premium pricing. Overall, product innovation is critical for staying relevant in a dynamic market, ultimately boosting revenue and sustaining long-term growth. Swire Coca-Cola is constantly introducing innovation products into the market.
Swire Coca-Cola faces a business challenge in optimizing production planning and inventory management for its innovative beverage products segment. These products do not have a exact match of the historic sales and Swire in the past have faced situations on both the extremes, i.e., leaving money on the table due to lower estimations of exact demand and then over production leading to product distress. The company aims to accurately forecast the weekly demand, identify the most profitable region and time periods for launching these products using historical sales and market data of similar products of Swire and it's competitiors, along with customer demographic information to strike a balance between both out-of-stock and overproduction situations.
The goals of this project are as follows:
The purpose of this notebook to develop multiple models to accurately forecast demand on a weekly basis for the given 7 innovative products. These models will be used by Swire to solve the problem with planning production and inventory for these products with the outcome being Swire takes maximum advantage to boost revenue from these innovative products with minimal overproduction.
The products that we are dealing with are innovative products, meaning, these products do not have an exact replica of sales in the past. Without historical data, building a model will pose a major challenge and blocker to this project.
To overcome this, the approach that we will be taking is to match and map the product attributes (Flavor/ Category/ Caloric Segment/ Package/ Manufacturer/ Brand) by use them to filter the data and then develop forecasting model. It would not be possible to apply all the 6 filters at a time, hence, we will be using 2-3 sets of filters for modelling in order to capture all the nuances in the data. Once we have a forecast with the initial set of filters, we will account for the remaining set of filters.
Why Flavor?
We will be filtering the dataset by flavor of the innovative product first because flavor is a primary determinant of consumer appeal and purchasing behavior in the beverage industry. Analyzing the performance of the specific flavor variant allows for a more targeted and accurate assessment of its market potential and forecasting demand. This targeted approach enables us to assess the demand and performance of the flavor independently of other variants or products within the same brand or category. Additionally, flavor is the a critical product attribute that might help capture seasonality in the dataset. This granularity in analysis improves the precision of our forecasting models
After flavor, the next important filters are brand/manufacturer, calory segment and category.
Why filtering by Category is important?
Analyzing the category, whether "Regular" or "Diet/Light," is crucial in understanding consumer purchase behavior due to the following reasons:
- Health Consciousness & Lifestyle preferences: s. Consumers opting for "Diet/Light" products are likely motivated by health considerations, such as calorie control or sugar reduction, influencing their purchase decisions. This is critical especially in diabetic patients or in regions where we identified more senior population proportion.
- Impact on Demand: Consumer perception of a product's category can significantly impact its demand dynamics. Changes in health trends or dietary recommendations may influence the popularity of "Diet/Light" products, while factors like taste innovations or marketing campaigns may drive demand for "Regular" offerings.
Later in the modelling we have observed strong correlation between diet product sales and time of the year. Diet sales is the highest towards the beginining of the year and then have gradually declined. This might be becasue people prefer healthier and low sugar options at the start of the year through new year resolutions.
Why filtering by Caloric Segment is important?
Similar to Category, again consumer preferences led by taste and health are predominat in determining if they will purchase a specific water type or a energy drink.
Why filtering by manufacturer or Brand is important?
Brand Loyalty: Consumers often exhibit brand loyalty, preferring products from specific brands due to factors such as trust, reputation, and perceived quality. Analyzing data specific to a brand therefore will be a more reliable forecast. </p>
1. Region Mapping
In the scope of the dataset, we have assumed regions and below is the mapping for the same:
North West States: Washington (WA), Oregon (OR)
North States: Idaho (ID), Wyoming (WY)
South States: Arizona (AZ), New Mexico (NM)
West States: Nevada (NV), California (CA)
Central States: Utah (UT), Colorado (CO)
East States: Nebraska (NE), South Dakota (SD), Kansas (KS)
2. 13 weeks of the year - consecutive or best 13 of the year?
The questions provided by Swire does not specify if the 13 weeks needs to be consecutive or can be the top 13 weeks of the year in any random fashion. For the sake of this project, we will consider the best set of 13 consecutive weeks. The reason for this is simple: In the inital couple of meetings with the Swire stakeholders, we were notified that based on how well the product performs in the following weeks post launch, retail stores shall be restocked. Morever, it might be a feasible option for Swire to launch a new product only for 1 week and then again restock several months later, which is not reasonable at all. Keeping all these factors in mind, in the scope of this project, when the questions asks which 13 weeks of the year, we will assume which consecutive 13 weeks of the year.
3. Empahsis on Unit Sales over Dollar Sales?
The business problem revolves around forecasting sales and not predicting revenue. Therefore, we will prioritize unit sales as it provide better insights into how many units Swire will need to be ready before launching a product, and the necessary restocking of the retail stores. This will also eliminate multicollinearity as identified in the EDA phase. </p>
4. Potential outliers?
In the notebook, you will find outliers detected for few products. However, since we were informed that the data has gonr through thorough cleaning already by the stakeholders before handing it over to us, we will not remove any outliers.
Data Preparation is the process of transforming raw data into a format that is suitable for analysis and modeling. This includes steps such as variable transformations, feature engineering and handling NAs. In the context of this project, data preparation will vary depending on the question. Some examples of data preparation are as follows: 1. For instance, questions with region involved will need a column to be added for marking regions based on the states that we mapped from the EDA. 2. Another instance of data preparation is that for the Prophet model, the inputs to the model needs to be formatted as 'ds' and 'y' columns only where ds is the date column and y is the actual values of the target variable. In our case we will focus on unit sales only and neglect dollar sales since from the EDA we observed very strong multi-collinearity between these two variables. 3. In the EDA phase, missing values have been identified for few particular Brands were missing. After thorough analysis on these brands, we have already imputed them. In addition to the above pointed out pre-model work, question and model specific data preparation will be listed wherever necessary throughout the notebook.
Prophet:
ARIMA (AutoRegressive Integrated Moving Average):
SARIMA (Seasonal AutoRegressive Integrated Moving Average):
Exponential Smoothing:
In summary, Prophet, ARIMA, SARIMA, and Exponential Smoothing models were chosen and will be used depending on the question and the dataset that we need to deal with. Having multiple models forecast (wherever feasible) on the same data is what we will be doing for most of the questions. Best model will be considered based on its performance on the test set and more importantly having lower error metrics.
# importing libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Setting the warnings to be ignored
warnings.filterwarnings('ignore')
#market_data = pd.read_csv(r"/content/drive/MyDrive/FACT_MARKET_DEMAND_Ref.csv")
import pandas as pd
market_demand = pd.read_csv("FACT_MARKET_DEMAND-001.csv")
Data Preparation: Let's impute with the NA's in Caloric Segment with Regular type as that is the mode. The explanation for this as already been covered in the EDA were found that all the missing values were from three Brands only and all the products from these brands were Null for the caloric segment. Upon digging deeper on the other aspects of the brand, it was concluded that Regular which is also a majority in the dataset is the right imputation strategy.
market_demand['CALORIC_SEGMENT'].fillna(market_demand['CALORIC_SEGMENT'].mode()[0], inplace=True)
print("\nMissing value imputation check:")
print(market_demand.isnull().sum().sum())
Missing value imputation check: 0
Item Description: Diet Smash Plum 11Small 4One
Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?
Begining filtering by Flavor:
# filtering for Plum
plum = market_demand[(market_demand['ITEM'].str.contains('PLUM', case=False, regex=True))]
Adding a week number column:
# Assuming 'DATE' column is in datetime format
plum['DATE'] = pd.to_datetime(plum['DATE'])
# Find the earliest date
earliest_date = plum['DATE'].min()
# Calculate the week numbers from the earliest date
plum['WEEK_NUMBER'] = (plum['DATE'] - earliest_date).dt.days // 7 + 1
# Aggregate data by week number
weekly_sales = plum.groupby('WEEK_NUMBER')['UNIT_SALES'].sum()
# Plot time series
plt.figure(figsize=(18, 6))
weekly_sales.plot(marker='o', color='blue', linestyle='-')
plt.title('Weekly Unit Sales')
plt.xlabel('Week Number')
plt.ylabel('Total Unit Sales')
plt.grid(True)
plt.show()
Plum appears to have good sales during the beginining and end weeks of the year. Lets also filter by Swire now and look at a holistic chart for sales all the years for Plum + Swire:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.signal import argrelextrema
# Assuming 'plum' dataframe contains 'WEEK_NUMBER' and 'UNIT_SALES' columns
# Aggregate data by week number
weekly_sales = plum[plum['MANUFACTURER'] == 'SWIRE-CC'].groupby('WEEK_NUMBER')['UNIT_SALES'].sum()
# Find local maxima (peaks)
peaks_index = argrelextrema(weekly_sales.values, comparator=lambda x, y: x > y, order=5)[0]
peaks = weekly_sales.iloc[peaks_index]
# Plot time series with peaks
plt.figure(figsize=(17, 8))
weekly_sales.plot(marker='o', color='blue', linestyle='-', label='Total Unit Sales')
peaks.plot(marker='o', color='red', linestyle='', markersize=8, label='Peaks')
# Annotate peaks with week number
for week, sales in peaks.items():
plt.annotate(f'Week {week}\n({sales} units)', xy=(week, sales), xytext=(-20, 20), textcoords='offset points',
arrowprops=dict(arrowstyle='->', color='black'))
plt.title('Weekly Unit Sales with Peaks')
plt.xlabel('Week Number')
plt.ylabel('Total Unit Sales')
plt.grid(True)
plt.legend()
plt.show()
Looks like peaks are approximately 13 week gap and Swire might be restocking the shelves.
Lets group by weeks of the year to gauge sales with time of the year and dive deeper by computing sales by aggregating sales data every 13 weeks:
plum['13_WEEK_INTERVAL'] = (plum['WEEK_NUMBER'] - 1) // 13 + 1
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(80, -20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(80, -20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
from the plot above using Plum + Swire data, we see that week 22-34 of the year are the highest from the historic data.
swire_plum = plum[plum['MANUFACTURER'] == 'SWIRE-CC']
swire_plum= swire_plum.sort_values('DATE')
swire_plum_daily = swire_plum.groupby('DATE')['UNIT_SALES'].sum().reset_index()
swire_plum_daily.head(2)
| DATE | UNIT_SALES | |
|---|---|---|
| 0 | 2020-12-05 | 1961.0 |
| 1 | 2020-12-12 | 1625.0 |
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
import pandas as pd
# Convert DATE column to datetime if not already in datetime format
swire_plum_daily['DATE'] = pd.to_datetime(swire_plum_daily['DATE'])
# Sort the data by date
swire_plum_daily = swire_plum_daily.sort_values(by='DATE')
# Calculate the index for partitioning the data
partition_index = int(len(swire_plum_daily) * 0.8)
# Split the data into train and test sets
train_data = swire_plum_daily.iloc[:partition_index]
test_data = swire_plum_daily.iloc[partition_index:]
# Rename columns as per Prophet's requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
# Print the lengths of train and test dataframes
print("Length of Train Data:", len(train_data))
print("Length of Test Data:", len(test_data))
# Print the start and end dates of train and test dataframes
print("Start Date of Train Data:", train_data['ds'].min())
print("End Date of Train Data:", train_data['ds'].max())
print("Start Date of Test Data:", test_data['ds'].min())
print("End Date of Test Data:", test_data['ds'].max())
Length of Train Data: 119 Length of Test Data: 30 Start Date of Train Data: 2020-12-05 00:00:00 End Date of Train Data: 2023-03-11 00:00:00 Start Date of Test Data: 2023-03-18 00:00:00 End Date of Test Data: 2023-10-28 00:00:00
# Initialize and fit the Prophet model
p_model = Prophet()
p_model.fit(train_data)
# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=34, freq='W')
# Make predictions
forecast = p_model.predict(future)
# Plot the forecast
plot_plotly(p_model, forecast)
plot_components_plotly(p_model, forecast)
we observe that while the sales of plum for swire dipped in 2022, it bounces back in 2023 changing the trend upwards. The yearly chart above shows peaks during March and August.
Lets plot a chart that shows how the model performed on the train and test sets:
pred = forecast[['ds','yhat']]
import matplotlib.pyplot as plt
# Plot test data
plt.figure(figsize=(12, 6))
plt.plot(train_data['ds'], train_data['y'], label='Actual Train Set', color='blue')
plt.plot(test_data['ds'], test_data['y'], label = 'Actual Test Set',color='lightblue')
# Plot forecast data
plt.plot(forecast['ds'], forecast['yhat'], label='Forecast', color='red')
# Add labels and title
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Actual vs Forecast')
plt.legend()
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Show plot
plt.show()
Model is unable to capture sharp spikes in the test set, but does a very fair job in capturing the trend and average revenue sales on a weekly basis. This is the expectation in fact. If the model is able to capture and respond to spikes, then that will be a case of overfitting and any outliers can do more harm than good for Swire.
# Function to calculate the scores
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
def get_scores(actual, predicted):
# Calculate errors
mae = mean_absolute_error(actual, predicted)
mse = mean_squared_error(actual, predicted)
rmse = sqrt(mse)
#percentage_diff = (np.abs(actual - predicted).sum()) /( np.abs(actual).mean())
# Combine actual and predicted values into one DataFrame
mape_df = pd.DataFrame({'actual': actual.values, 'predicted': predicted.values})
# Calculate absolute percentage error
mape_df['absolute_percentage_error'] = np.abs((mape_df['actual'] - mape_df['predicted']) / mape_df['actual']) * 100
# Calculate MAPE
mape = mape_df['absolute_percentage_error'].mean()
# Calculate MAPE
#mape = percentage_diff.mean()
# Calculate "Accuracy" Percentage
accuracy_percentage = 100 - mape
# Print metrics
# print(f"MAE: {mae}")
# print(f"MSE: {mse}")
# print(f"RMSE: {rmse}")
# print(f"MAPE: {mape}%")
# print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")
return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Assuming mulberry_diet_sparkling_validation is your DataFrame
swire_plum_sparkling_validation = swire_plum_daily.copy()
# Convert Date to datetime variable
swire_plum_sparkling_validation['DATE'] = pd.to_datetime(swire_plum_sparkling_validation['DATE'])
# Group by date to ensure only one record for each date
swire_plum_sparkling_validation = swire_plum_sparkling_validation.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Sort the data by date
swire_plum_sparkling_validation = swire_plum_sparkling_validation.sort_values(by='DATE')
# Calculate the index for partitioning the data
partition_index = int(len(swire_plum_sparkling_validation) * 0.8)
# Split the data into train and test sets
train_data = swire_plum_sparkling_validation.iloc[:partition_index]
test_data = swire_plum_sparkling_validation.iloc[partition_index:]
# Renaming for Prophet requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
p_model = Prophet()
p_model.fit(train_data)
# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=0, freq='W')
# Make predictions
forecast = p_model.predict(future)
# Extracting the date and unit sales from forecast
pred = forecast[['ds','yhat']]
22:00:01 - cmdstanpy - INFO - Chain [1] start processing 22:00:01 - cmdstanpy - INFO - Chain [1] done processing
get_scores(test_data['y'], pred['yhat'])
mae 162.088814 mse 43781.580074 rmse 209.240484 mape 13.020970 direct_accuracy 86.979030 dtype: float64
The model is performing well with mape 13% and accuracy 86%
Lets compute 13 week aggregate sales:
earliest_date = forecast['ds'].min()
# Step 2: Calculate week numbers
forecast['WEEK_NUMBER'] = ((forecast['ds'] - earliest_date).dt.days // 7) % 52
weekly_forecast_data = forecast.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
# Sort the DataFrame by 'WEEK_NUMBER' if it's not already sorted
weekly_forecast_data = weekly_forecast_data.sort_values(by='WEEK_NUMBER')
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data) < 13:
print("Insufficient data to compute rolling sum.")
else:
# Initialize an empty list to store the results
rolling_sales_sum = []
# Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
for i in range(len(weekly_forecast_data) - 12):
# Calculate the sum of UNIT_SALES for the current 13-week interval
sales_sum = weekly_forecast_data.loc[i:i+12, 'yhat'].sum()
# Construct the 13-week interval string
interval = f"{weekly_data.iloc[i]['WEEK_NUMBER']}-{weekly_data.iloc[i+12]['WEEK_NUMBER']}"
# Append the result to the list
rolling_sales_sum.append((sales_sum, interval))
# Create a new DataFrame from the results
rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
# Display the DataFrame with the rolling sum and 13-week intervals
# print(rolling_sales_df)
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = round(sales_sum[max_index])
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(70, 20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
While historic data revealed week 22-34 fetching the highest sales, forecasted data shows that in 2024 week 35-47 period would fetch maximum sales for plum flavor for Swire. We see that the maximum values of 13 week aggregates for both historic and forecast are nearly the same (73k).
plum_swire_diet = swire_plum[swire_plum['CALORIC_SEGMENT'] == 'DIET/LIGHT']
plum_swire_diet_weekly_sales = plum_swire_diet.groupby('DATE')['UNIT_SALES'].sum().reset_index()
plum_swire_diet_weekly_sales.head(2)
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
import pandas as pd
# Convert DATE column to datetime if not already in datetime format
plum_swire_diet_weekly_sales['DATE'] = pd.to_datetime(plum_swire_diet_weekly_sales['DATE'])
# Sort the data by date
plum_swire_diet_weekly_sales = plum_swire_diet_weekly_sales.sort_values(by='DATE')
plum_swire_diet_model_data = plum_swire_diet_weekly_sales.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
# Initialize and fit the Prophet model
p_model = Prophet()
p_model.fit(plum_swire_diet_model_data)
# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=52, freq='W')
# Make predictions
forecast = p_model.predict(future)
# Plot the forecast
plot_plotly(p_model, forecast)
This is interesting, diet sales dipped in 2021 and from then on maintained a side ways to slighly negative trend.
earliest_date = forecast['ds'].min()
# Step 2: Calculate week numbers
forecast['WEEK_NUMBER'] = ((forecast['ds'] - earliest_date).dt.days // 7) % 52
weekly_forecast_data = forecast.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
# Step 4: Create a time series plot
plt.figure(figsize=(12, 4))
plt.plot(weekly_forecast_data['WEEK_NUMBER'], weekly_forecast_data['yhat'], marker='o', linestyle='-')
plt.xlabel('Week Number')
plt.ylabel('Unit Sales')
plt.title('Weekly Total Unit Sales')
plt.grid(True)
plt.show()
sales at the beginining of the year appear to be the highest. Lets aggregate on a 13 week basis to identify best consectuive 13 weeks:
# Sort the DataFrame by 'WEEK_NUMBER' if it's not already sorted
weekly_forecast_data = weekly_forecast_data.sort_values(by='WEEK_NUMBER')
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data) < 13:
print("Insufficient data to compute rolling sum.")
else:
# Initialize an empty list to store the results
rolling_sales_sum = []
# Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
for i in range(len(weekly_forecast_data) - 12):
# Calculate the sum of UNIT_SALES for the current 13-week interval
sales_sum = weekly_forecast_data.loc[i:i+12, 'yhat'].sum()
# Construct the 13-week interval string
interval = f"{weekly_data.iloc[i]['WEEK_NUMBER']}-{weekly_data.iloc[i+12]['WEEK_NUMBER']}"
# Append the result to the list
rolling_sales_sum.append((sales_sum, interval))
# Create a new DataFrame from the results
rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
# Display the DataFrame with the rolling sum and 13-week intervals
#print(rolling_sales_df)
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = round(sales_sum[max_index])
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(70, 20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
It is clear from the plot above that week 0-12 is the clear winner after filtering for diet caloric segment on top of plum and swire. Might indicate on how people on a new year take resolution on cutting down on sugars and as year progresses go back to their habits.
Now lets filter the forecasted data to last 1 year and extract the first 13 weeks of the year.
# week wise sales for best 13 weeks
best_13_weeks_q1 = forecast[(forecast['WEEK_NUMBER'] <= 12) & (forecast['ds'] > '2023-02-25')]
best_13_weeks_q1.head(2)
| ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | additive_terms | additive_terms_lower | additive_terms_upper | yearly | yearly_lower | yearly_upper | multiplicative_terms | multiplicative_terms_lower | multiplicative_terms_upper | yhat | WEEK_NUMBER | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 145 | 2023-12-03 | 752.538260 | 476.851987 | 886.220095 | 750.513944 | 754.733644 | -61.95856 | -61.95856 | -61.95856 | -61.95856 | -61.95856 | -61.95856 | 0.0 | 0.0 | 0.0 | 690.579701 | 0 |
| 146 | 2023-12-10 | 749.318607 | 435.234791 | 850.450446 | 746.771570 | 752.273543 | -89.59627 | -89.59627 | -89.59627 | -89.59627 | -89.59627 | -89.59627 | 0.0 | 0.0 | 0.0 | 659.722337 | 1 |
# Compute total sales for all 13 weeks
total_sales = best_13_weeks_q1['yhat'].sum()
# Compute lower and upper bounds for estimate
lower_bound = best_13_weeks_q1['yhat_lower'].sum()
upper_bound = best_13_weeks_q1['yhat_upper'].sum()
# Calculate confidence interval
confidence_interval = ((upper_bound - lower_bound) / total_sales) * 100
# Display the results
print("Total Sales for 13 weeks:", total_sales)
print("Lower Bound Estimate:", lower_bound)
print("Upper Bound Estimate:", upper_bound)
print("Confidence Interval (%):", confidence_interval)
Total Sales for 13 weeks: 9057.187762419222 Lower Bound Estimate: 6291.951059331268 Upper Bound Estimate: 11799.518942786954 Confidence Interval (%): 60.80880763351411
Therefore, the best 13 weeks are 0-12 weeks with a total sales forecast of 9058 units. The above lower and upper bound represent the lower and upper limits of the estimated sales range derived from the Prophet model.
CI % suggests that there is a 60.81% level of confidence that the true sales value falls within the range defined by the lower and upper bounds. Lets visualize through a plot:
import matplotlib.pyplot as plt
# Extract week numbers and corresponding sales estimates
week_numbers = best_13_weeks_q1['WEEK_NUMBER']
sales_estimates = best_13_weeks_q1['yhat']
lower_bounds = best_13_weeks_q1['yhat_lower']
upper_bounds = best_13_weeks_q1['yhat_upper']
# Plot the time series graph with confidence interval
plt.figure(figsize=(12, 6))
plt.plot(week_numbers, sales_estimates, color='blue', label='Sales Estimate')
plt.fill_between(week_numbers, lower_bounds, upper_bounds, color='skyblue', alpha=0.4, label='Confidence Interval')
plt.xlabel('Week Number')
plt.ylabel('Total Sales')
plt.title('Week-wise Distribution of Total Sales with Confidence Interval')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Considerations: For demand specific qs, neglecting package and market key as it doesn't add any value but simply increases complexity. Moreover, package type given in question 1 is completely new one in the entire dataset and cannot be extrapolated and even nearest neighbors methods won't be of use.
Having said that, lets compare sales statistics of category and brand for this innovative product with what we have after applying previous filters. This comparative analysis will offer insights into potential adjustments to sales proportions.
plum_swire_diet['CATEGORY'].value_counts(normalize=True)
ENERGY 0.998592 SPARKLING WATER 0.001126 SSD 0.000282 Name: CATEGORY, dtype: float64
Since nearly all the data we obtained for forecasts comes for Energy category while the this plum innovative product is of SSD category, lets compare how the unit sales fared for both these categories in the entire dataset:
In the overall dataset, 69.2% of unit sales compared to Energy.
# Filter DataFrame for 'SSD' category
ssd_sales = market_demand[market_demand['CATEGORY'] == 'SSD']['UNIT_SALES'].sum()
# Filter DataFrame for 'ENERGY' category
energy_sales = market_demand[market_demand['CATEGORY'] == 'ENERGY']['UNIT_SALES'].sum()
print(" % of unit sales for SSD category compared to SSD + ENERGY category:", ssd_sales/(energy_sales+ssd_sales))
print(" % of unit sales for ENERGY category compared to SSD + ENERGY category:", energy_sales/(energy_sales+ssd_sales))
% of unit sales for SSD category compared to SSD + ENERGY category: 0.6923682683459191 % of unit sales for ENERGY category compared to SSD + ENERGY category: 0.30763173165408086
import pandas as pd
# Filter DataFrame for 'SSD' category
ssd_data = market_demand[market_demand['CATEGORY'] == 'SSD']['UNIT_SALES'].describe()
# Filter DataFrame for 'ENERGY' category
energy_data = market_demand[market_demand['CATEGORY'] == 'ENERGY']['UNIT_SALES'].describe()
# Concatenate the DataFrames horizontally
combined_data = pd.concat([ssd_data, energy_data], axis=1)
combined_data.columns = ['SSD', 'ENERGY'] # Rename columns for clarity
combined_data
| SSD | ENERGY | |
|---|---|---|
| count | 1.291228e+07 | 5.932087e+06 |
| mean | 2.088707e+02 | 2.020072e+02 |
| std | 1.026061e+03 | 8.425682e+02 |
| min | 4.000000e-02 | 4.000000e-02 |
| 25% | 1.500000e+01 | 7.000000e+00 |
| 50% | 5.300000e+01 | 3.300000e+01 |
| 75% | 1.560000e+02 | 1.370000e+02 |
| max | 9.677600e+04 | 7.165700e+04 |
From the above three reuslts, we see the correlation between SSD and ENERGY categories. About 68.5% of the sales between these two have come from SSD signifying that 70% of the records are of SSD. Suprisingly, even the total sales is the same proportion (69%). The above statistics also suggests that the mean sales is nearly same for both (around 200).
Through all this, we can safely assume SSD is more or less same as ENERGY and there is no need to change our sales estimates.
Now let's look at Brand:
plum_swire_diet['BRAND'].value_counts(normalize=True)
VENOMOUS BLAST 0.998592 CUPADA ARID 0.001126 DIET MITE PURE ZERO 0.000282 Name: BRAND, dtype: float64
Nearly all the sales from the data used for forecasting is from VENOMOUS BLAST brand. Comparing sales between this brand and the brand we have in the question DIET SMASH:
# Filter DataFrame for 'SSD' category
vb_brand_sales = market_demand[market_demand['BRAND']=='VENOMOUS BLAST']['UNIT_SALES'].sum()
# Filter DataFrame for 'ENERGY' category
dietsmash_sales = market_demand[market_demand['BRAND']=='DIET SMASH']['UNIT_SALES'].sum()
print(" % of unit sales for DIET SMASH Brand compared to other two:", dietsmash_sales/(vb_brand_sales+dietsmash_sales))
print(" % of unit sales for VENOMOUS BLAST Brand compared to other two:", vb_brand_sales/(vb_brand_sales+dietsmash_sales))
% of unit sales for DIET SMASH Brand compared to other two: 0.11897748583594121 % of unit sales for VENOMOUS BLAST Brand compared to other two: 0.8810225141640589
# Filter DataFrame for 'SSD' category
vb_data =market_demand[market_demand['BRAND']=='VENOMOUS BLAST']['UNIT_SALES'].describe()
dietsmash_data= market_demand[market_demand['BRAND']=='DIET SMASH']['UNIT_SALES'].describe()
# Concatenate the DataFrames horizontally
combined_data = pd.concat([dietsmash_data, vb_data], axis=1)
combined_data.columns = ['DIET SMASH', 'VENOMOUS BLAST'] # Rename columns for clarity
combined_data
| DIET SMASH | VENOMOUS BLAST | |
|---|---|---|
| count | 17483.000000 | 51756.000000 |
| mean | 27.975093 | 69.975983 |
| std | 30.263642 | 246.676148 |
| min | 1.000000 | 1.000000 |
| 25% | 9.000000 | 6.000000 |
| 50% | 21.000000 | 15.000000 |
| 75% | 38.000000 | 40.250000 |
| max | 698.000000 | 3701.000000 |
There is huge difference in standard deviation, mean and max values. There seems to outliers in the data pushing the mean higher for VB brand.
Presence of potential Outliers in VENOMOUS BLAST Brand:
Plots for both the brands clearly shows the presence of several high purchase orders or potential outliers for VB Brand. However, we see a very similar value for 25%, 50% and even 75% quartile values from the table above. These outliers have pushed the average unit sales higher for VB.
Let's dive into these outlying points:
import matplotlib.pyplot as plt
# Filter the data for 'VENOMOUS BLAST' and 'DIET SMASH' brands
venomous_blast_data = market_demand[market_demand['BRAND'] == 'VENOMOUS BLAST']
diet_smash_data = market_demand[market_demand['BRAND'] == 'DIET SMASH']
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
# Plot for 'VENOMOUS BLAST' brand
axes[0].hist(venomous_blast_data['UNIT_SALES'], bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Unit Sales Distribution for VENOMOUS BLAST Brand')
axes[0].set_xlabel('Unit Sales')
axes[0].set_ylabel('Frequency')
axes[0].grid(True)
# Plot for 'DIET SMASH' brand
axes[1].hist(diet_smash_data['UNIT_SALES'], bins=30, color='lightgreen', edgecolor='black')
axes[1].set_title('Unit Sales Distribution for DIET SMASH Brand')
axes[1].set_xlabel('Unit Sales')
axes[1].set_ylabel('Frequency')
axes[1].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Plot confirms presence of outliers. Diving deep:
import numpy as np
venomous_blast_data = market_demand[market_demand['BRAND'] == 'VENOMOUS BLAST']
# percentile values
percentiles = np.arange(75, 96, 5) # Percentiles from 75 to 95 in steps of 5
percentiles = np.append(percentiles, np.arange(96, 101, 1)) # Percentiles from 96 to 100 in steps of 1
percentile_values = np.percentile(venomous_blast_data['UNIT_SALES'], percentiles)
percentile_table = pd.DataFrame({'Percentile': percentiles, 'Value': percentile_values})
percentile_table
| Percentile | Value | |
|---|---|---|
| 0 | 75 | 40.25 |
| 1 | 80 | 52.00 |
| 2 | 85 | 72.00 |
| 3 | 90 | 115.00 |
| 4 | 95 | 231.00 |
| 5 | 96 | 299.00 |
| 6 | 97 | 402.00 |
| 7 | 98 | 594.90 |
| 8 | 99 | 1612.80 |
| 9 | 100 | 3701.00 |
Table for VB above proves to us that even until 98 percentile, the values closely follow and match with the maximum value for DIET SMASH which is about 700. However, the 99th and 100th percentile values of 1600 and 3700 are ridiculously high in the scope of this dataset.
In this context, these outliers account for VB's higher mean value and hence we will proceed to consider average sales values to be more or less same for both the brands.
Model MAPE error is .. over the test set.
From the forecast obtained from the model, we observe best 13 weeks are week 0 to week 12 where week 0 starts from the 1st week of December as we have taken minimum data in the dataset as the 1st week.
The obtained forecast for the best 13 weeks are about 9060 in total unit sales.
Using Historic Ratio method, we analyzed and comapred the remaining product attributes and found not much of a difference. Therefore the forecasted sales figure of 9060 will remain the final result for this innovative product.
Item Description: Sparkling Jacceptabletlester Avocado 11Small MLT
Swire plans to release this product 2 weeks prior to Easter and 2 weeks post Easter. What will the forecasted demand be, in weeks, for this product?
avocado = market_demand[(market_demand['ITEM'].str.contains('AVOCADO', case=False, regex=True))]
avocado_regular = avocado[(avocado['CALORIC_SEGMENT'].str.contains('REGULAR', case=False, regex=True, na=False))]
avocado_regular_ssd = avocado_regular[(avocado_regular['CATEGORY'].str.contains('SSD', case=False, regex=True))]
avocado_regular_ssd_swire = avocado_regular_ssd[(avocado_regular_ssd['MANUFACTURER'].str.contains('SWIRE', case=False, regex=True))]
avocado_regular_ssd_swire_small = avocado_regular_ssd_swire[(avocado_regular_ssd_swire['PACKAGE'].str.contains('SMALL', case=False, regex=True))]
We're able to find avocado flavor in the data. Along with it is filtered with it's caloric segment, market category, and manufacturer. We have filtered the package type to small which is only closest we have given that its a new packaging type.
df = avocado_regular_ssd_swire_small[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df = df.asfreq('W-SAT')
plt.plot(df)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
Text(0, 0.5, 'UNIT_SALES')
There are NAN in some weeks. as we dig deeper we see that there are no sale on the month of august 2023.
Before building model, lets validate by creating train and test sets:
# Train and Test periods
from datetime import datetime
s_date = '2023-02-25'
e_date = '2023-08-26'
df = df.loc[df.index <= e_date]
train_period = df.loc[df.index < s_date]
test_period = df.loc[(df.index >= s_date) & (df.index <= e_date)]
start_forecast = datetime.strptime(s_date, '%Y-%m-%d').date()
end_forecast = datetime.strptime(e_date, '%Y-%m-%d').date()
easter_start_2023 = '2023-04-01'
easter_end_2023 = '2023-04-22'
easter_weeks_test = df.loc[(df.index >= easter_start_2023) & (df.index <= easter_end_2023)]
We will create and resuse this function to measure each models performance with the data. This will include MAE, MSE, RMSE, MAPE and direct accuracy percentage.
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
def get_scores(actual, predicted):### Modelling Process
# Calculate errors
mae = mean_absolute_error(actual, predicted)
mse = mean_squared_error(actual, predicted)
rmse = sqrt(mse)
percentage_diff = np.abs((actual - predicted) / actual) * 100
# Calculate MAPE
mape = percentage_diff.mean()
# Calculate "Accuracy" Percentage
accuracy_percentage = 100 - mape
# Print metrics
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"MAPE: {mape}%")
print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")
return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])
ARIMA is a statistical model used for time series analysis to forecast future data points by leveraging past data. It combines three main aspects: autoregression (AR), differencing (I) to make the time series stationary, and moving average (MA). The AR part exploits the relationship between an observation and a number of lagged observations, the I part involves differencing the data to achieve stationarity, and the MA part models the error of the observation as a combination of previous error terms.
from statsmodels.tsa.arima.model import ARIMA
# ARIMA Model fitting
order = (1, 1, 1) # Still (p, d, q)
model = ARIMA(train_period['UNIT_SALES'], order=order)
model_fit = model.fit()
# Forecasting
forecast = model_fit.predict(start=start_forecast,
end=end_forecast)
arima_forecast = forecast
arima_score = get_scores(test_period.squeeze(), arima_forecast)
MAE: 13878.577969407064 MSE: 333660932.83898854 RMSE: 18266.3880622029 MAPE: 13.029688220376284% Direct 'Accuracy' Percentage: 86.97031177962371%
Metrics for Easter weeks:
arima_easter_score = get_scores(easter_weeks_test.squeeze(), arima_forecast.loc[(arima_forecast.index >= easter_start_2023) & (arima_forecast.index <= easter_end_2023)])
MAE: 16694.848477498203 MSE: 332614641.0282627 RMSE: 18237.72576359955 MAPE: 15.726140646868652% Direct 'Accuracy' Percentage: 84.27385935313134%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(easter_weeks_test['UNIT_SALES'], label='Actual Sales')
plt.plot(arima_forecast.loc[(arima_forecast.index >= easter_start_2023) & (arima_forecast.index <= easter_end_2023)], label='Forecast', color='red')
plt.xticks(easter_weeks_test.index)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
Plot above shows the actual vs forecast of the sales for the Easter weeks. Since, the time period is small, looks like ARIMA is unable to capture details.
SARIMA extends ARIMA by explicitly accommodating and modeling seasonal effects in time series data. It includes additional seasonal elements on top of the AR, I, and MA components. SARIMA is characterized by its ability to model both non-seasonal and seasonal components of the time series data, making it more versatile than ARIMA for data with clear seasonal patterns, such as sales data around specific holidays or events. It incorporates additional parameters to handle seasonality, which are seasonal AR, seasonal differencing, and seasonal MA components, allowing it to capture seasonal fluctuations effectively, making it ideal for products with seasonal demand.
from statsmodels.tsa.statespace.sarimax import SARIMAX
# SARIMA Model fitting
order = (1, 1, 1) # (p, d, q)
seasonal_order = (1, 1, 1, 52) # (P, D, Q, s)
model = SARIMAX(train_period['UNIT_SALES'], order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
# Forecasting
forecast = model_fit.predict(start=start_forecast.strftime('%Y-%m-%d'),
end=end_forecast.strftime('%Y-%m-%d'))
sarima_forecast = forecast
sarima_score = get_scores(test_period.squeeze(), sarima_forecast)
MAE: 10543.398089743641 MSE: 170655919.02185866 RMSE: 13063.533940777996 MAPE: 9.790936856842896% Direct 'Accuracy' Percentage: 90.20906314315711%
Metrics for Easter Weeks:
sarima_easter_score = get_scores(easter_weeks_test.squeeze(), sarima_forecast.loc[(sarima_forecast.index >= easter_start_2023) & (sarima_forecast.index <= easter_end_2023)])
MAE: 19749.376548111453 MSE: 497937909.1387423 RMSE: 22314.522382043993 MAPE: 19.265721342075864% Direct 'Accuracy' Percentage: 80.73427865792414%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(easter_weeks_test['UNIT_SALES'], label='Actual Sales')
plt.plot(sarima_forecast.loc[(sarima_forecast.index >= easter_start_2023) & (sarima_forecast.index <= easter_end_2023)], label='Forecast', color='red')
plt.xticks(easter_weeks_test.index)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
SARIMA surely appears to react better to the details of the actual unit sales changes over ARIMA.
Prophet is a forecasting tool designed by Facebook for handling time series data that displays patterns on different time scales such as yearly, weekly, and daily. It is especially useful for data with strong seasonal effects and several seasons of historical data. Prophet works by fitting nonlinear trends with yearly, weekly, and daily seasonality, plus holiday effects. It is robust to missing data and shifts in the trend, and typically requires no manual tuning of parameters. The model accommodates seasonality through Fourier series and includes components for holidays and special events, making it well-suited for predicting demand for products around specific events or holidays, like Easter.
from prophet import Prophet
df_mod = train_period.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']
model = Prophet(changepoint_prior_scale=0.1,seasonality_prior_scale=0.5, weekly_seasonality=True)
model.fit(df_mod)
# Forecast the next 52 weeks (1 year)
future = model.make_future_dataframe(periods=53, freq='W-SAT')
forecast = model.predict(future)
prophet_forecast = forecast[['ds','yhat']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)]
prophet_score = get_scores(test_period.squeeze(), prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)].squeeze())
MAE: 10354.142149221083 MSE: 178938113.5920231 RMSE: 13376.775156666987 MAPE: 9.585562452804588% Direct 'Accuracy' Percentage: 90.4144375471954%
Easter week metrics
prophet_easter_score = get_scores(easter_weeks_test.squeeze(), prophet_forecast.loc[(prophet_forecast.index >= easter_start_2023) & (prophet_forecast.index <= easter_end_2023)].squeeze())
MAE: 13672.249500615759 MSE: 250724046.9852324 RMSE: 15834.268122816173 MAPE: 13.398864372818881% Direct 'Accuracy' Percentage: 86.60113562718112%
plt.figure(figsize=(10, 6))
sns.lineplot(data=easter_weeks_test, x=easter_weeks_test.index, y='UNIT_SALES', label='Actual Sales')
sns.lineplot(data=prophet_forecast.loc[(prophet_forecast.index >= easter_start_2023) & (prophet_forecast.index <= easter_end_2023)], x=prophet_forecast.loc[(prophet_forecast.index >= easter_start_2023) & (prophet_forecast.index <= easter_end_2023)].index, y='yhat', label='Forecast', color='red')
plt.xticks(easter_weeks_test.index)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
Exponential Smoothing is a time series forecasting method for univariate data that applies smoothing factors to the observations, giving more weight to recent observations while not discarding older observations entirely. It encompasses simple exponential smoothing for data with no clear trend or seasonality, and extends to Holt’s linear trend model and Holt-Winters’ seasonal model, which can account for both trends and seasonality in the data. This method is straightforward and computationally efficient, making it a good choice for producing quick forecasts in situations where data patterns are reasonably consistent over time, but may struggle with data that has complex patterns or significant irregularities.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# ExponentialSmoothing Model fitting
model = ExponentialSmoothing(train_period['UNIT_SALES'], trend='add', seasonal='add', seasonal_periods=52).fit(smoothing_level=0.6, smoothing_slope=0.2, smoothing_seasonal=0.2)
# Forecasting
forecast_periods = ((end_forecast - start_forecast).days // 7) + 1
exponential_forecast = model.forecast(forecast_periods)
forecast_dates = pd.date_range(start=start_forecast, periods=forecast_periods, freq='W-SAT')
exponential_smoothing_score = get_scores(test_period.squeeze(), exponential_forecast)
MAE: 11515.003782552438 MSE: 176593000.4153336 RMSE: 13288.829911445688 MAPE: 11.473932683680282% Direct 'Accuracy' Percentage: 88.52606731631971%
Easter week metrics:
exponential_smoothing_easter_score = get_scores(easter_weeks_test.squeeze(), exponential_forecast.loc[(exponential_forecast.index >= easter_start_2023) & (exponential_forecast.index <= easter_end_2023)])
MAE: 15194.8459825914 MSE: 339963988.23290277 RMSE: 18438.112382586856 MAPE: 15.957019954688251% Direct 'Accuracy' Percentage: 84.04298004531175%
# Plotting the results
easter_start_2023 = pd.to_datetime(easter_start_2023)
easter_end_2023 = pd.to_datetime(easter_end_2023)
# Filter the dates
filtered_dates = forecast_dates[(forecast_dates >= easter_start_2023) & (forecast_dates <= easter_end_2023)]
plt.figure(figsize=(10, 6))
plt.plot(easter_weeks_test.index, easter_weeks_test['UNIT_SALES'], label='Actual Sales')
plt.plot(filtered_dates, exponential_forecast.loc[(exponential_forecast.index >= easter_start_2023) & (exponential_forecast.index <= easter_end_2023)], label='Forecast', color='red')
plt.xticks(easter_weeks_test.index)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
pd.options.display.float_format = '{:.2f}'.format
q2_scores = pd.DataFrame({'Arima':arima_score, 'Sarima':sarima_score, 'Prophet':prophet_score, 'Exponential Smoothing':exponential_smoothing_score}).T
print(q2_scores)
mae mse rmse mape direct_accuracy Arima 13878.58 333660932.84 18266.39 13.03 86.97 Sarima 10543.40 170655919.02 13063.53 9.79 90.21 Prophet 10354.14 178938113.59 13376.78 9.59 90.41 Exponential Smoothing 11515.00 176593000.42 13288.83 11.47 88.53
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
fig.suptitle('Model Performance Metrics')
for i, column in enumerate(q2_scores.columns):
row, col = divmod(i, 3)
sns.barplot(ax=axes[row, col], x=q2_scores.index, y=q2_scores[column])
axes[row, col].set_title(column)
axes[row, col].set_ylabel('')
# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Based on the detailed metrics provided, the Prophet model is identified as the most fitting choice for this context. Notably, while its MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error) values are slightly higher than those of the SARIMA model, the differences are marginal. This slight increase in error metrics is counterbalanced by Prophet's MAPE (Mean Absolute Percentage Error) of 9.59, which is the lowest among the models evaluated, indicating its predictions are proportionately closer to actual values, making it highly effective in scenarios where percentage errors are more impactful than absolute errors. Moreover, Prophet's direct accuracy rate of 90.41% is the highest, signifying its unmatched precision in predicting future values directly. This level of accuracy is particularly valuable in applications where correct directional forecasting is crucial. Furthermore, Prophet's flexibility in handling seasonality and missing data, along with its ability to incorporate external variables, offers a robust modeling choice that adapts well to complex datasets. Thus, considering the balance between nuanced accuracy, error margin, and the ability to handle real-world data complexities, Prophet emerges as the superior model for forecasting tasks where adaptability and precision are paramount.
Let's take a look at how good is in predicting easter 2023.
q2_easter_scores = pd.DataFrame({'Arima':arima_easter_score, 'Sarima':sarima_easter_score, 'Prophet':prophet_easter_score, 'Exponential Smoothing':exponential_smoothing_easter_score}).T
print(q2_easter_scores)
mae mse rmse mape direct_accuracy Arima 16694.85 332614641.03 18237.73 15.73 84.27 Sarima 19749.38 497937909.14 22314.52 19.27 80.73 Prophet 13672.25 250724046.99 15834.27 13.40 86.60 Exponential Smoothing 15194.85 339963988.23 18438.11 15.96 84.04
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
fig.suptitle('Model Performance Metrics')
for i, column in enumerate(q2_easter_scores.columns):
row, col = divmod(i, 3)
sns.barplot(ax=axes[row, col], x=q2_scores.index, y=q2_scores[column])
axes[row, col].set_title(column)
axes[row, col].set_ylabel('')
# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Given the updated metrics, the Prophet model distinctly emerges as the premier choice for forecasting Easter 2023. This preference is underscored by its superior Mean Absolute Error (MAE) of 13672.25, which is the lowest among the models, indicating that its average deviation from actual values is minimal. This is particularly significant in forecasting scenarios where precision in predictions is paramount. The Prophet model also boasts a commendable Root Mean Squared Error (RMSE) of 15834.27, which, despite not being the absolute lowest, reflects its robustness in handling both small and large prediction errors effectively. Furthermore, its Mean Absolute Percentage Error (MAPE) stands at 13.40, underscoring its efficiency in maintaining proportionate accuracy across various magnitudes of data—a critical aspect for ensuring reliability in percentage terms.
The model's Direct Accuracy rate of 86.60% is another key factor in its favor, showcasing its exceptional ability to predict the direction of trends accurately. This attribute is essential in applications where understanding the directional movement of data points is more critical than the exact values.
Prophet's standout performance is not just in its error metrics but also in its inherent features that cater well to forecasting tasks like Ea2023. It adeptly handles seasonality and trends over time, making it exceptionally suitable for predicting events with complex patterns. Additionally, Prophet's flexibility in integrating holidays and special events into its forecasts further cements its suitability for accurately predicting specific occasions like Easter, which moves annually.
Considering these attributes, the Prophet model offers a harmonious blend of accuracy, reliability, and adaptability, making it the most fitting choice for forecasting events with significant annual variations like r 2023. Its ability to deliver precise forecasts despite the complexities of seasonal patterns and holiday impacts renders it a robust tool for such predictive tasks.
df = avocado_regular_ssd_swire_small[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df = df.loc[df.index <= "2023-08-26"]
df = df.asfreq('W-SAT')
start_forecast = datetime.strptime("2024-03-23", '%Y-%m-%d').date()
end_forecast = datetime.strptime("2024-04-13", '%Y-%m-%d').date()
from prophet import Prophet
df_mod = df.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']
model = Prophet(changepoint_prior_scale=0.1,seasonality_prior_scale=0.5, weekly_seasonality=True)
model.fit(df_mod)
# Forecast the next 52 weeks (1 year)
future = model.make_future_dataframe(periods=53, freq='W-SAT')
forecast = model.predict(future)
prophet_forecast = forecast[['ds','yhat','yhat_upper','yhat_lower']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat','yhat_upper','yhat_lower']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= "2024-03-23") & (prophet_forecast.index <= "2024-04-13")]
prophet_forecast
| yhat | yhat_upper | yhat_lower | |
|---|---|---|---|
| DATE | |||
| 2024-03-23 | 93763.73 | 110608.42 | 75509.07 |
| 2024-03-30 | 98443.33 | 116404.55 | 81232.11 |
| 2024-04-06 | 108617.99 | 126353.05 | 92715.61 |
| 2024-04-13 | 115180.50 | 130721.01 | 97186.91 |
values in yhat column above are the forecasted unit sales during the Easter weeks.
plt.figure(figsize=(10, 6))
plt.plot(prophet_forecast.index, prophet_forecast['yhat'], label='Predicted', color='blue', marker='o')
plt.fill_between(prophet_forecast.index, prophet_forecast['yhat_lower'], prophet_forecast['yhat_upper'], color='gray', alpha=0.2, label='Confidence Interval')
plt.xticks(prophet_forecast.index)
plt.title('Forecast with Confidence Interval')
plt.xlabel('Date')
plt.ylabel('Forecasted UNIT_SALES')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Given the new data and insights, our exploration into forecasting the demand for Swire's innovative product, the Sparkling Jacceptabletlester Avocado 11Small MLT around Easter 2024, was meticulously conducted using advanced time series analysis. The process commenced with a thorough data preparation phase, where the market demand dataset was strategically filtered to hone in on relevant parameters: avocado flavor, regular caloric segment, SSD market category, Swire-CC as the manufacturer, and a focus on small package types. This precise data curation was pivotal in aligning the analysis with the specific product profile of interest, ensuring a targeted approach to forecasting.
Upon evaluation, the forecasting models—ARIMA, SARIMA, Prophet, and Exponential Smoothing—underwent rigorous performance assessment through a suite of metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Direct 'Accuracy' Percentage. These metrics served as critical indicators of each model's predictive accuracy and reliability, essential for identifying the most suitable forecasting tool for the task at hand.
The performance summary revealed a standout model, differing from the initial assessment, underscoring the dynamic nature of forecasting methodologiTs, the Prophet model demonstrated exceptional prowess, outperforming others in key metrics. Its predictive capabilities showcased not just through lower MAE and RMSE values but also in its superior MAPE and Direct Accuracy percentages. These strengths indicate the Prophet model's predictions are not only close to actual sales data but also consistently reliable across different forecasting scenarios.
Interestingly, while the Exponential Smoothing model showed promise in the initial analysis, a reevaluation based on the latest insights pointed towards the Prophet model's unmatched efficiency and adaptability, particularly in handling seasonal variations around Easter 2024. The Prophet model's ability to incorporate seasonal trends, holidays, and other external variables makes it exceptionally suitable for forecasting demand for products like the Sparkling Jacceptabletlester Avocado 11Small MLT, which likely sees fluctuating demand due to seasonal events such as Easter.
The forecast from the Prophet model suggests a nuanced understanding of demand dynamics around Easter 2024, projecting an upswing in sales that peaks in the week leading up to Easter, before moderating in the following weeks. This pattern offers Swire invaluable strategic insights, enabling the company to optimize production, distribution, and marketing efforts to fully capitalize on the anticipated demand surge.
In conclusion, the Prophet model, with its robust handling of complex seasonal patterns and its adeptness in forecasting under conditions of uncertainty, stands out as the superior choice for predicting the demand for Swire's new avocado-flavored beverage around Easter 2024. The insights derived from this model provide a solid foundation for informed decision-making, ensuring that Swire can navigate the seasonal market dynamics effectively and maximize the product's success in the competitive beverage landscape.
Item Description: Diet Venomous Blast Energy Drink Kiwano 16 Liquid Small
Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?
# filtering for kiwano
kiwano = market_demand[(market_demand['ITEM'].str.contains('KIWANO', case=False, regex=True))]
# Filtering Swire + kiwano data
kiwano_swire = kiwano[kiwano['MANUFACTURER'] == 'SWIRE-CC']
# Filtering Swire +kiwano + Diet
kiwano_swire_diet = kiwano_swire[kiwano_swire['CALORIC_SEGMENT'] == 'DIET/LIGHT']
Grouping by date to aggregate daily sales data:
# Assuming 'DATE' column is in datetime format
kiwano_swire_diet['DATE'] = pd.to_datetime(kiwano_swire_diet['DATE'])
kiwano_swire_diet_sorted= kiwano_swire_diet.sort_values('DATE')
# Aggregate by Date and compute unit sales
kiwano_swire_diet_sorted_daily = kiwano_swire_diet.groupby('DATE')['UNIT_SALES'].sum().reset_index()
Adding week number column:
earliest_date = kiwano_swire_diet_sorted_daily['DATE'].min()
# Step 2: Calculate week numbers
kiwano_swire_diet_sorted_daily['WEEK_NUMBER'] = ((kiwano_swire_diet_sorted_daily['DATE'] - earliest_date).dt.days // 7) % 52
# Step 3: Aggregate the data for every week
weekly_data_kiwano = kiwano_swire_diet_sorted_daily.groupby('WEEK_NUMBER')['UNIT_SALES'].sum().reset_index()
Aggregating data by a 13 week interval and plotting the rolling 13 week sum sales:
import pandas as pd
# Assuming weekly_data is your DataFrame containing 'WEEK_NUMBER' and 'UNIT_SALES' columns
# Sort the DataFrame by 'WEEK_NUMBER' if it's not already sorted
weekly_data_kiwano = weekly_data_kiwano.sort_values(by='WEEK_NUMBER')
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_data_kiwano) < 13:
print("Insufficient data to compute rolling sum.")
else:
# Initialize an empty list to store the results
rolling_sales_sum = []
# Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
for i in range(len(weekly_data_kiwano) - 12):
# Calculate the sum of UNIT_SALES for the current 13-week interval
sales_sum = weekly_data_kiwano.loc[i:i+12, 'UNIT_SALES'].sum()
# Construct the 13-week interval string
interval = f"{weekly_data_kiwano.iloc[i]['WEEK_NUMBER']}-{weekly_data_kiwano.iloc[i+12]['WEEK_NUMBER']}"
# Append the result to the list
rolling_sales_sum.append((sales_sum, interval))
# Create a new DataFrame from the results
rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(70, -20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
From the plot, we could clearly see that the maximum sales are observed from the week 19- Week 31.
The data was split into training and testing sets with 80 and 20 % ratio
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
import pandas as pd
# Calculate the index for partitioning the data
partition_index = int(len(kiwano_swire_diet_sorted_daily) * 0.8)
# Split the data into train and test sets
train_data = kiwano_swire_diet_sorted_daily.iloc[:partition_index]
test_data = kiwano_swire_diet_sorted_daily.iloc[partition_index:]
# Rename columns as per Prophet's requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
# Print the lengths of train and test dataframes
print("Length of Train Data:", len(train_data))
print("Length of Test Data:", len(test_data))
# Print the start and end dates of train and test dataframes
print("Start Date of Train Data:", train_data['ds'].min())
print("End Date of Train Data:", train_data['ds'].max())
print("Start Date of Test Data:", test_data['ds'].min())
print("End Date of Test Data:", test_data['ds'].max())
Length of Train Data: 118 Length of Test Data: 30 Start Date of Train Data: 2020-12-05 00:00:00 End Date of Train Data: 2023-03-04 00:00:00 Start Date of Test Data: 2023-03-11 00:00:00 End Date of Test Data: 2023-10-28 00:00:00
# Python
m = Prophet()
m.fit(train_data)
pred_df = m.predict(test_data)
pred_df.head()
12:27:59 - cmdstanpy - INFO - Chain [1] start processing 12:28:00 - cmdstanpy - INFO - Chain [1] done processing
| ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | additive_terms | additive_terms_lower | additive_terms_upper | yearly | yearly_lower | yearly_upper | multiplicative_terms | multiplicative_terms_lower | multiplicative_terms_upper | yhat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-03-11 | 19485.394281 | 15920.064503 | 19464.224109 | 19483.843303 | 19488.275912 | -1759.759261 | -1759.759261 | -1759.759261 | -1759.759261 | -1759.759261 | -1759.759261 | 0.0 | 0.0 | 0.0 | 17725.635020 |
| 1 | 2023-03-18 | 19247.662322 | 16487.443642 | 19995.816392 | 19232.930247 | 19265.132350 | -1028.066953 | -1028.066953 | -1028.066953 | -1028.066953 | -1028.066953 | -1028.066953 | 0.0 | 0.0 | 0.0 | 18219.595369 |
| 2 | 2023-03-25 | 19009.930363 | 17453.491002 | 20805.587420 | 18976.524053 | 19051.408845 | 49.613071 | 49.613071 | 49.613071 | 49.613071 | 49.613071 | 49.613071 | 0.0 | 0.0 | 0.0 | 19059.543434 |
| 3 | 2023-04-01 | 18772.198405 | 18148.134072 | 21589.310780 | 18712.260886 | 18844.356146 | 1040.777482 | 1040.777482 | 1040.777482 | 1040.777482 | 1040.777482 | 1040.777482 | 0.0 | 0.0 | 0.0 | 19812.975887 |
| 4 | 2023-04-08 | 18534.466446 | 18501.014865 | 21900.768394 | 18440.762472 | 18635.148118 | 1679.434987 | 1679.434987 | 1679.434987 | 1679.434987 | 1679.434987 | 1679.434987 | 0.0 | 0.0 | 0.0 | 20213.901433 |
Above are predicted results on the test data.
fig1 = m.plot(pred_df)
The dotted area shows the actual values and the blue region shows the future prediction
To evaluate the performance of a Prophet model for time series forecasting, we have used the following metrics to gauge its accuracy and effectiveness.
Mean Absolute Percentage Error (MAPE): MAPE measures the percentage difference between actual and predicted values, providing insight into the overall accuracy of the forecasts. It gives us a better understanding of the magnitude of errors relative to the actual values.
Mean Absolute Error (MAE): MAE measures the average magnitude of errors between predicted and actual values, regardless of their direction. This method is robust to outliers and can help in the better interpetation of the results.
Mean Squared Error (MSE): MSE measures the average squared difference between predicted and actual values, emphasizing larger errors due to the squaring effect. MSE is widely used for assessing the goodness of fit of a model, penalizing larger errors more heavily.
Root Mean Squared Error (RMSE): RMSE is the square root of the MSE and shares its characteristics while providing results in the same units as the original data. It is widely used metric as the results are easy and intuitive interpretation as we can relate with the original data.
Accuracy: Accuracy would provide a interpretation on how accurate our model is performing.
All these metrics collectively offer a comprehensive assessment of the Prophet model's performance in forecasting time series data and for better decision making.
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
def dividefn(num1, num2):
try:
return num1/num2
except ZeroDivisionError:
return 0
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
out = [0] * len(y_true)
for i in np.arange(len(y_true)):
a = (y_pred[i] - y_true[i])*1
b = y_true[i]
out[i] = dividefn(a,b)
return np.mean(np.abs(out))*100
def val_metrics_sales(inp_df):
metric= []
mape = np.round(mean_absolute_percentage_error(test_data['y'],pred_df['yhat']),3)
mae = np.round(mean_absolute_error(test_data['y'], pred_df['yhat']),3)
mse_error = np.round(mean_squared_error(test_data['y'],pred_df['yhat']),3)
rmse = np.round(np.sqrt(mean_squared_error(test_data['y'],pred_df['yhat'])),3)
accuracy = np.round((100 - abs(mape)),3)
metric.append([mape,mae,mse_error,rmse,accuracy])
metric = np.array(metric)
return metric
Validation_metric = val_metrics_sales(pred_df)
print("MAPE:" , Validation_metric[0][0])
print("MAE:" , Validation_metric[0][1])
print("MSE_Error:" , Validation_metric[0][2])
print("RMSE:" , Validation_metric[0][3])
print("Accuracy:" , Validation_metric[0][4])
MAPE: 20.181 MAE: 3731.182 MSE_Error: 19613458.552 RMSE: 4428.708 Accuracy: 79.819
Our model gave 80 % accuracy with Mape error of 20%.
kiwano_swire_diet_sorted_daily.tail(1)
| DATE | UNIT_SALES | WEEK_NUMBER | |
|---|---|---|---|
| 147 | 2023-10-28 | 15549.67 | 47 |
The last date is 2023-10-28
kiwano_future = kiwano_swire_diet_sorted_daily.copy()
# Renaming columns for prophet requirements
kiwano_future.rename(columns = {'DATE':'ds','UNIT_SALES':'y'}, inplace = True)
m_2 = Prophet()
m_2.fit(kiwano_future)
# Create future dates for forecasting
future_kiwano = m_2.make_future_dataframe(periods=52,freq = 'W')
# Forecasting for 1 year
future_kiwano_forecast = m_2.predict(future_kiwano)
future_kiwano_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
12:30:42 - cmdstanpy - INFO - Chain [1] start processing 12:30:42 - cmdstanpy - INFO - Chain [1] done processing
| ds | yhat | yhat_lower | yhat_upper | |
|---|---|---|---|---|
| 195 | 2024-09-22 | 9112.684434 | 5202.013051 | 13014.200778 |
| 196 | 2024-09-29 | 8509.657160 | 4758.182902 | 12584.130980 |
| 197 | 2024-10-06 | 7670.426502 | 3965.991301 | 11509.779202 |
| 198 | 2024-10-13 | 6735.039638 | 2736.616109 | 10611.176965 |
| 199 | 2024-10-20 | 6006.095057 | 1857.985922 | 10155.324535 |
Above are the prediction results
plot_plotly(m_2, future_kiwano_forecast)
# Filtering out prediction to get forecasted sales
forecast_future = future_kiwano_forecast[future_kiwano_forecast['ds'] > '2023-10-28']
forecast_future
# # No Negative Sales
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)
forecast_future_date_sales = forecast_future[['ds','yhat']]
forecast_future_date_sales.head(5)
| ds | yhat | |
|---|---|---|
| 148 | 2023-10-29 | 13004.568376 |
| 149 | 2023-11-05 | 12468.837648 |
| 150 | 2023-11-12 | 11637.278663 |
| 151 | 2023-11-19 | 10663.885980 |
| 152 | 2023-11-26 | 9845.748470 |
Extracted data for 1 year forecast
earliest_date = forecast_future_date_sales['ds'].min()
# Step 2: Calculate week numbers
forecast_future_date_sales['WEEK_NUMBER'] = ((forecast_future_date_sales['ds'] - earliest_date).dt.days // 7) % 52
# Step 3: Aggregate the data for every week
weekly_data_kiwano_forecasted = forecast_future_date_sales.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
import pandas as pd
# Assuming weekly_data is your DataFrame containing 'WEEK_NUMBER' and 'UNIT_SALES' columns
# Sort the DataFrame by 'WEEK_NUMBER' if it's not already sorted
weekly_data_kiwano_forecasted = weekly_data_kiwano_forecasted.sort_values(by='WEEK_NUMBER')
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_data_kiwano_forecasted) < 13:
print("Insufficient data to compute rolling sum.")
else:
# Initialize an empty list to store the results
rolling_sales_sum = []
# Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
for i in range(len(weekly_data_kiwano_forecasted) - 12):
# Calculate the sum of UNIT_SALES for the current 13-week interval
sales_sum = weekly_data_kiwano_forecasted.loc[i:i+12, 'yhat'].sum()
# Construct the 13-week interval string
interval = f"{weekly_data_kiwano_forecasted.iloc[i]['WEEK_NUMBER']}-{weekly_data_kiwano_forecasted.iloc[i+12]['WEEK_NUMBER']}"
# Append the result to the list
rolling_sales_sum.append((sales_sum, interval))
# Create a new DataFrame from the results
rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(70, -20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
Our historical data have highest sales from the week 19 - week 31.
Our forecasting demand range is similar to the range of already existing historical data demand, making our results valid and much more derivable in the future.
The best 13 weeks for the forecasted year is 24 - 36 and the total units sales is 186 K.
This is the consideration with Utah being the central point
# Function to calculate the scores
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
def get_scores(actual, predicted):
# Calculate errors
mae = mean_absolute_error(actual, predicted)
mse = mean_squared_error(actual, predicted)
rmse = sqrt(mse)
#percentage_diff = (np.abs(actual - predicted).sum()) /( np.abs(actual).mean())
# Combine actual and predicted values into one DataFrame
mape_df = pd.DataFrame({'actual': actual.values, 'predicted': predicted.values})
# Calculate absolute percentage error
mape_df['absolute_percentage_error'] = np.abs((mape_df['actual'] - mape_df['predicted']) / mape_df['actual']) * 100
# Calculate MAPE
mape = mape_df['absolute_percentage_error'].mean()
# Calculate MAPE
#mape = percentage_diff.mean()
# Calculate "Accuracy" Percentage
accuracy_percentage = 100 - mape
# Print metrics
# print(f"MAE: {mae}")
# print(f"MSE: {mse}")
# print(f"RMSE: {rmse}")
# print(f"MAPE: {mape}%")
# print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")
return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])
# Reading the zip_to_market data
zip_to_market = pd.read_csv(r"zip_to_market_unit_mapping.csv")
consumer_demographics = pd.read_csv(r"pivoted consumer data.csv")
Mapping states to region:
# Dictionary mapping states to regions
state_regions = {
'WA': 'North West',
'OR': 'North West',
'ID': 'North',
'WY': 'North',
'AZ': 'South',
'NM': 'South',
'NV': 'West',
'CA': 'West',
'UT': 'Central',
'CO': 'Central',
'NE': 'East',
'SD': 'East',
'KS': 'East'
}
# Function to get region for a state
def get_region(state):
return state_regions.get(state, 'Unknown')
# Add a new column 'Region' based on 'State'
consumer_demographics['Region'] = consumer_demographics['State'].map(get_region)
Filtering for North region:
# Selecting only Zip and Region column
consumer_demographics_zip_region = consumer_demographics[['Zip', 'Region']]
# Filtering the dataset to consider only North regions
consumer_demographics_north = consumer_demographics_zip_region[consumer_demographics_zip_region['Region'] == 'North']
Joining datasets and removing unncecessary columns:
# Perform inner join on 'Zip' and 'ZIP_CODE' columns
merged_data = pd.merge(consumer_demographics_north, zip_to_market, left_on='Zip', right_on='ZIP_CODE', how='inner')
# Remove duplicates in MARKET_KEY and Region
merged_data = merged_data.drop_duplicates(subset=['MARKET_KEY', 'Region'])
## Dropping Zip Column
merged_data.drop(columns='Zip', inplace=True)
# Dropping the Zip Code Column
merged_data.drop(columns='ZIP_CODE', inplace=True)
checking Null values in the merged dataset:
# Combining market_demand with zip_customer dataset
final_merged_data = pd.merge(market_demand, merged_data, on='MARKET_KEY', how='inner')
# Checking if there are null values
final_merged_data.isnull().sum().sum()
0
# Filtering the mulberry flavor from item
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
mulberry_swire = mulberry[mulberry['MANUFACTURER'] == 'SWIRE-CC']
mulberry_swire_diet = mulberry_swire[mulberry_swire['CALORIC_SEGMENT'] == 'DIET/LIGHT']
# Printing the shape of the new dataset which contains only mulberry
mulberry.shape
(12784, 11)
# Printing the shape
mulberry_swire_diet.shape
(1708, 11)
There are about 1700 records with the filters applied.
# Convert 'DATE' column to datetime format
mulberry_swire_diet['DATE'] = pd.to_datetime(mulberry_swire_diet['DATE'])
# Grouping by date column and computing unit sales to eliminate duplicates
weekly_data = mulberry_swire_diet.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Group by year and sum the UNIT_SALES
yearly_sales = weekly_data.groupby(weekly_data['DATE'].dt.year)['UNIT_SALES'].sum()
# Create a new DataFrame from the grouped data
yearly_sales_df = pd.DataFrame({'Year': yearly_sales.index, 'Unit_Sales': yearly_sales.values})
# Printing the dataframe
yearly_sales_df
| Year | Unit_Sales | |
|---|---|---|
| 0 | 2020 | 4187.0 |
| 1 | 2021 | 50501.0 |
| 2 | 2022 | 7165.0 |
| 3 | 2023 | 156.0 |
The unit sales is highest for the year 2021 with 50K followed by 2022 and 2020. The least unit sales is from the year 2023 which accounts for a mere 156 sales. There is a negative trend after the year 2021.
import matplotlib.pyplot as plt
# Find the peak sales year and value
peak_sales_year = yearly_sales_df.loc[yearly_sales_df['Unit_Sales'].idxmax(), 'Year']
peak_sales_value = yearly_sales_df['Unit_Sales'].max()
# Plot
plt.figure(figsize=(10, 6))
plt.plot(yearly_sales_df['Year'], yearly_sales_df['Unit_Sales'], marker='o', linestyle='-')
plt.title('Yearly Unit Sales')
plt.xlabel('Year')
plt.ylabel('Unit Sales')
plt.grid(True)
plt.xticks(yearly_sales_df['Year']) # Show all years on x-axis
# Mark the peak sales value
plt.annotate(f'Peak: {peak_sales_value} units', xy=(peak_sales_year, peak_sales_value),
xytext=(peak_sales_year + 1, peak_sales_value + 1000),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.tight_layout()
plt.show()
There is a negative trend for mulberry + swire + diet filtered products from 2021 onwards.
# Converting Date to datetime variable
mulberry_swire_diet['DATE'] = pd.to_datetime(mulberry_swire_diet['DATE'])
# Grouping by date to ensure only one record for each date
mulberry_swire_diet = mulberry_swire_diet.groupby('DATE')['UNIT_SALES'].sum().reset_index()
Training the entire dataset for forecasting
# Extracting mulberry flavor
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
# Filtering out the Swire sales
mulberry_swire = mulberry[mulberry['MANUFACTURER'] == 'SWIRE-CC']
# Filtering out swire+mulberry+diet
mulberry_swire_diet = mulberry_swire[mulberry_swire['CALORIC_SEGMENT'] == 'DIET/LIGHT']
# Copying the datset
mulberry_swire_diet_forecast = mulberry_swire_diet.copy()
# Converting to date time
mulberry_swire_diet_forecast['DATE'] = pd.to_datetime(mulberry_swire_diet_forecast['DATE'])
# Grouping to ensure one record for each date
mulberry_swire_diet_forecast_daily = mulberry_swire_diet_forecast.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Sorting by date column
mulberry_swire_diet_forecast = mulberry_swire_diet_forecast_daily.sort_values(by='DATE')
Data prepartion for Prophet:
# Renaming for Prophet model requirements
mulberry_swire_diet_forecast = mulberry_swire_diet_forecast.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
# Extracting the last date
mulberry_swire_diet_forecast.tail(1)
| ds | y | |
|---|---|---|
| 134 | 2023-07-01 | 1.0 |
The last date is 2023-07-01. Building the model:
# Creating an instance for the model
p_model1 = Prophet()
# Fitting the model
p_model1.fit(mulberry_swire_diet_forecast)
# Create future dates for forecasting for 1 year
future = p_model1.make_future_dataframe(periods=52,freq = 'W')
# Make predictions
forecast1 = p_model1.predict(future)
# Plot the forecast
# plot_plotly(p_model1, forecast1)
# Extracting the future sales for 1 year
forecast_future = forecast1[forecast1['ds'] > '2023-07-01']
# Printing to ensure if the filter is working right
forecast_future.head(2)
| ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | additive_terms | additive_terms_lower | additive_terms_upper | yearly | yearly_lower | yearly_upper | multiplicative_terms | multiplicative_terms_lower | multiplicative_terms_upper | yhat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 135 | 2023-07-02 | -110.053709 | -156.706731 | 201.795569 | -110.053709 | -110.053709 | 135.114942 | 135.114942 | 135.114942 | 135.114942 | 135.114942 | 135.114942 | 0.0 | 0.0 | 0.0 | 25.061234 |
| 136 | 2023-07-09 | -114.546123 | -197.287970 | 165.726691 | -114.818539 | -114.317650 | 100.385351 | 100.385351 | 100.385351 | 100.385351 | 100.385351 | 100.385351 | 0.0 | 0.0 | 0.0 | -14.160772 |
# Removing any negative sales
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)
# Computing yearly sales
yearly_forecast_mulberry = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum()
# Create a new DataFrame from the grouped data
yearly_forecast_mulberry_df = pd.DataFrame({'Year': yearly_forecast_mulberry.index, 'Unit_Sales': yearly_forecast_mulberry.values})
# Printing the data
yearly_forecast_mulberry_df
| Year | Unit_Sales | |
|---|---|---|
| 0 | 2023 | 25.061234 |
| 1 | 2024 | 0.000000 |
Considering various combinations to determine why are the sales dipping in swire + diet + mulberry.
#### Extracing the mulberry flavor
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
mulberry_forecast = mulberry.copy()
# Converting to date time variable
mulberry_forecast['DATE'] = pd.to_datetime(mulberry_forecast['DATE'])
# No duplicate date rows
mulberry_forecast_daily = mulberry_forecast.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Sort according to the date
mulberry_forecast = mulberry_forecast_daily.sort_values(by='DATE')
# Rename for prophet requirements
mulberry_forecast = mulberry_forecast.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
# Printing the last date
mulberry_forecast.tail(5)
| ds | y | |
|---|---|---|
| 143 | 2023-09-30 | 4307.0 |
| 144 | 2023-10-07 | 4330.0 |
| 145 | 2023-10-14 | 4467.0 |
| 146 | 2023-10-21 | 4254.0 |
| 147 | 2023-10-28 | 3974.0 |
p_model2 = Prophet()
p_model2.fit(mulberry_forecast)
# Create future dates for forecasting
future = p_model2.make_future_dataframe(periods=52,freq = 'W')
# Make predictions
forecast2 = p_model2.predict(future)
# Plot the forecast
# plot_plotly(p_model2, forecast2)
Mulberry sales are better compared Mulberry + Swire + Diet
# Extracting the forecated sales
forecast_future = forecast2[forecast2['ds'] > '2023-10-28']
forecast_future.tail(1)
| ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | additive_terms | additive_terms_lower | additive_terms_upper | yearly | yearly_lower | yearly_upper | multiplicative_terms | multiplicative_terms_lower | multiplicative_terms_upper | yhat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 199 | 2024-10-20 | 3778.492273 | 2305.154067 | 3752.276782 | 3405.329831 | 4128.001289 | -712.177312 | -712.177312 | -712.177312 | -712.177312 | -712.177312 | -712.177312 | 0.0 | 0.0 | 0.0 | 3066.314961 |
The trend might be decreasing but there are still postive sales in 1000's.
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)
yearly_forecast_mulberry = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum()
# Create a new DataFrame from the grouped data
yearly_forecast_mulberry_df = pd.DataFrame({'Year': yearly_forecast_mulberry.index, 'Unit_Sales': yearly_forecast_mulberry.values})
yearly_forecast_mulberry_df
| Year | Unit_Sales | |
|---|---|---|
| 0 | 2023 | 44463.494491 |
| 1 | 2024 | 173997.204651 |
The forecasted sales for 1 year is 218,460. Mulberry (or flavor) is not the problem for low unit sales in Mulberry + Diet + Swire
# Extracing mulberry flavor
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
# Filtering for diet
mulberry_diet = mulberry[mulberry['CALORIC_SEGMENT'] == 'DIET/LIGHT']
mulberry_diet_forecast = mulberry_diet.copy()
# Converting to date time
mulberry_diet_forecast['DATE'] = pd.to_datetime(mulberry_diet_forecast['DATE'])
# Ensuring no duplicate records for date
mulberry_diet_forecast_daily = mulberry_diet_forecast.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Sorting the date
mulberry_diet_forecast_daily = mulberry_diet_forecast_daily.sort_values(by='DATE')
# Renaming for prophet requiremts
mulberry_diet_forecast = mulberry_diet_forecast_daily.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
# Printing the last date
mulberry_diet_forecast.tail(3)
| ds | y | |
|---|---|---|
| 145 | 2023-10-14 | 4467.0 |
| 146 | 2023-10-21 | 4254.0 |
| 147 | 2023-10-28 | 3974.0 |
# Creating an instance for the model
p_model4 = Prophet()
# Fitting the model
p_model4.fit(mulberry_diet_forecast)
# Create future dates for forecasting
future = p_model4.make_future_dataframe(periods=52,freq = 'W')
# Make predictions
forecast3 = p_model4.predict(future)
# Plot the forecast
# plot_plotly(p_model4, forecast3)
Lets forecast for the last two years:
# Extracting the future sales for 1 year
forecast_future = forecast3[forecast3['ds'] > '2023-10-28']
# Ensuring no negative unit sales
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)
# Compute yearly sales
yearly_forecast_mulberry_diet = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum()
# Create a new DataFrame from the grouped data
yearly_forecast_mulberry_diet_df = pd.DataFrame({'Year': yearly_forecast_mulberry_diet.index, 'Unit_Sales': yearly_forecast_mulberry_diet.values})
yearly_forecast_mulberry_diet_df
| Year | Unit_Sales | |
|---|---|---|
| 0 | 2023 | 37951.326633 |
| 1 | 2024 | 178579.675365 |
The forecasted sales for 1 year is 216K. Mulberry (or flavor) + Diet (Caloric Segment) is not the problem for low unit sales in Mulberry + Diet + Swire. Lets check for Swire now:
# Extracting mulberry flavor
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
# Filtering Swire sales
mulberry_swire = mulberry[mulberry['MANUFACTURER'] == 'SWIRE-CC']
mulberry_swire_forecast = mulberry_swire.copy()
# Converting to Date Time
mulberry_swire_forecast['DATE'] = pd.to_datetime(mulberry_swire_forecast['DATE'])
# Aggregating by date
mulberry_swire_forecast_daily = mulberry_swire_forecast.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Sorting by date
mulberry_swire_forecast = mulberry_swire_forecast_daily.sort_values(by='DATE')
# Renaming the columns for Prophet requirements
mulberry_swire_forecast = mulberry_swire_forecast.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
# Printing the last date
mulberry_swire_forecast.tail(1)
| ds | y | |
|---|---|---|
| 134 | 2023-07-01 | 1.0 |
# Creating an instance for the Prophet model
p_model3 = Prophet()
# Fitting the modle
p_model3.fit(mulberry_swire_forecast)
# Create future dates for forecasting
future = p_model3.make_future_dataframe(periods=52,freq = 'W')
# Make predictions
forecast2 = p_model3.predict(future)
# Plot the forecast
# plot_plotly(p_model3, forecast2)
# Future Sales
forecast_future = forecast2[forecast2['ds'] > '2023-07-01']
# No Negative Sales
forecast_future['yhat'] = forecast_future['yhat'].clip(lower=0)
# Year wise sales
yearly_forecast_mulberry_swire = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum()
# Create a new DataFrame from the grouped data
yearly_forecast_mulberry_df = pd.DataFrame({'Year': yearly_forecast_mulberry_swire.index, 'Unit_Sales': yearly_forecast_mulberry_swire.values})
yearly_forecast_mulberry_df
| Year | Unit_Sales | |
|---|---|---|
| 0 | 2023 | 27.580757 |
| 1 | 2024 | 0.000000 |
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
mulberry_diet = mulberry[mulberry['CALORIC_SEGMENT'] == 'DIET/LIGHT']
mulberry_diet['CATEGORY'].value_counts()
CATEGORY SPARKLING WATER 8399 ING ENHANCED WATER 3679 SSD 18 Name: count, dtype: int64
mulberry_diet_sparkling = mulberry_diet[mulberry_diet['CATEGORY'] == 'SPARKLING WATER']
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Assuming mulberry_diet_sparkling_validation is your DataFrame
mulberry_diet_sparkling_validation = mulberry_diet_sparkling.copy()
# Convert Date to datetime variable
mulberry_diet_sparkling_validation['DATE'] = pd.to_datetime(mulberry_diet_sparkling_validation['DATE'])
# Group by date to ensure only one record for each date
mulberry_diet_sparkling_validation = mulberry_diet_sparkling_validation.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Sort the data by date
mulberry_diet_sparkling_validation = mulberry_diet_sparkling_validation.sort_values(by='DATE')
# Calculate the index for partitioning the data
partition_index = int(len(mulberry_diet_sparkling_validation) * 0.8)
# Split the data into train and test sets
train_data = mulberry_diet_sparkling_validation.iloc[:partition_index]
test_data = mulberry_diet_sparkling_validation.iloc[partition_index:]
# Renaming for Prophet requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
p_model = Prophet()
p_model.fit(train_data)
# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=34, freq='W')
# Make predictions
forecast = p_model.predict(future)
# Extracting the date and unit sales from forecast
pred = forecast[['ds','yhat']]
# Plotting actual vs forecasted sales
plt.plot(train_data['ds'], train_data['y'], label='Actual Train Set', color='blue')
plt.plot(test_data['ds'], test_data['y'], label='Actual Test Set', color='green')
plt.plot(pred['ds'], pred['yhat'], label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Actual vs Forecast')
plt.legend()
plt.xticks(rotation=45)
plt.axhline(0, color='black', linewidth=0.5)
plt.show()
22:37:56 - cmdstanpy - INFO - Chain [1] start processing 22:37:56 - cmdstanpy - INFO - Chain [1] done processing
Model appears to be doing a fine job. Checking error metrics:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Assuming mulberry_diet_sparkling_validation is your DataFrame
mulberry_diet_sparkling_validation = mulberry_diet_sparkling.copy()
# Convert Date to datetime variable
mulberry_diet_sparkling_validation['DATE'] = pd.to_datetime(mulberry_diet_sparkling_validation['DATE'])
# Group by date to ensure only one record for each date
mulberry_diet_sparkling_validation = mulberry_diet_sparkling_validation.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Sort the data by date
mulberry_diet_sparkling_validation = mulberry_diet_sparkling_validation.sort_values(by='DATE')
# Calculate the index for partitioning the data
partition_index = int(len(mulberry_diet_sparkling_validation) * 0.8)
# Split the data into train and test sets
train_data = mulberry_diet_sparkling_validation.iloc[:partition_index]
test_data = mulberry_diet_sparkling_validation.iloc[partition_index:]
# Renaming for Prophet requirements
train_data = train_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
test_data = test_data.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
p_model = Prophet()
p_model.fit(train_data)
# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=0, freq='W')
# Make predictions
forecast = p_model.predict(future)
# Extracting the date and unit sales from forecast
pred = forecast[['ds','yhat']]
get_scores(train_data['y'], pred['yhat'])
23:14:15 - cmdstanpy - INFO - Chain [1] start processing 23:14:15 - cmdstanpy - INFO - Chain [1] done processing
mae 226.692952 mse 94751.882440 rmse 307.817937 mape 3.956043 direct_accuracy 96.043957 dtype: float64
forecast = p_model.predict(test_data)
pred = forecast[['ds','yhat']]
get_scores(test_data['y'], pred['yhat'])
mae 654.519767 mse 618493.972874 rmse 786.443878 mape 13.653096 direct_accuracy 86.346904 dtype: float64
The MAPE error is 13 %. Model is working fine no overfitting. Next is to forecast where we train the entire dataset and forescast for the next one year.
mulberry_diet_sparkling['DATE'] = pd.to_datetime(mulberry_diet_sparkling['DATE'])
# Ensuring no duplicate records for date
mulberry_diet_sparkling_daily = mulberry_diet_sparkling.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# Sorting the date
mulberry_diet_sparkling_daily = mulberry_diet_sparkling_daily.sort_values(by='DATE')
# Renaming for prophet requiremts
mulberry_diet_sparkling_daily = mulberry_diet_sparkling_daily.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
# Printing the last date
mulberry_diet_sparkling_daily.tail(1)
| ds | y | |
|---|---|---|
| 147 | 2023-10-28 | 3896.0 |
# Creating an instance for the model
p_model5 = Prophet()
# Fitting the model
p_model5.fit(mulberry_diet_sparkling_daily)
# Create future dates for forecasting
future5 = p_model5.make_future_dataframe(periods=52,freq = 'W')
# Make predictions
forecast5 = p_model5.predict(future5)
# Plot the forecast
plot_plotly(p_model5, forecast5)
# Future Sales
forecast5_future = forecast5[forecast5['ds'] > '2023-10-28']
forecast5_future
# # No Negative Sales
forecast5_future['yhat'] = forecast5_future['yhat'].clip(lower=0)
# # Year wise sales
yearly_forecast_mulberry_diet_sparkling = forecast5_future.groupby(forecast5_future['ds'].dt.year)['yhat'].sum()
# # Create a new DataFrame from the grouped data
yearly_forecast_mulberry_sparking_diet_df = pd.DataFrame({'Year': yearly_forecast_mulberry_diet_sparkling.index, 'Unit_Sales': yearly_forecast_mulberry_diet_sparkling.values})
yearly_forecast_mulberry_sparking_diet_df
| Year | Unit_Sales | |
|---|---|---|
| 0 | 2023 | 38484.278250 |
| 1 | 2024 | 184932.248948 |
From the above table, last two months in 2023 plus the remaining months in 2024 gives a forecasted sale for 1 year of 223K considering diet + mulberry + sparkling water.
Accounting for Swire:
mulberry = final_merged_data[(final_merged_data['ITEM'].str.contains('MULBERRIES', case=False, regex=True))]
mulberry_diet = mulberry[mulberry['CALORIC_SEGMENT'] == 'DIET/LIGHT']
mulberry_diet_sparkling = mulberry_diet[mulberry_diet['CATEGORY'] == 'SPARKLING WATER']
mulberry_diet_sparkling['MANUFACTURER'].value_counts(normalize=True)
MANUFACTURER JOLLYS 0.571735 COCOS 0.151328 BEARS 0.141207 SWIRE-CC 0.135730 Name: proportion, dtype: float64
Swire constitutes 13% of unit sales compared to other brands in the filtered data. Hence, we will only consider 13% of the obtained forecast. Therefore, 13% of 223K computes to above 29K unit sales.
Item Description: Diet Energy Moonlit Casava 2L Multi Jug
Swire plans to release this product for 6 months. What will the forecasted demand be, in weeks, for this product? roduct?
Filtering the dataset:
multi_jug_2l = market_demand[market_demand['PACKAGE']=='2L MULTI JUG']
diet_multi_jug_2l = multi_jug_2l[multi_jug_2l['CALORIC_SEGMENT']=='DIET/LIGHT']
swire_diet_multi_jug_2l = diet_multi_jug_2l[diet_multi_jug_2l['MANUFACTURER']=='SWIRE-CC']
swire_diet_multi_jug_2l_moonlit = swire_diet_multi_jug_2l[swire_diet_multi_jug_2l['BRAND'].str.contains('DIET MOONLIT', case=False, regex=True)]
df_6 = swire_diet_multi_jug_2l_moonlit[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df_6['DATE'] = pd.to_datetime(df_6['DATE'])
df_6.set_index('DATE', inplace=True)
df_6 = df_6.asfreq('W-SAT')
plt.plot(df_6)
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
Text(0, 0.5, 'UNIT_SALES')
There are NAN in some weeks. as we dig deeper we see that there are no sale on the month of august 2023.
# Train and Test periods
from datetime import datetime
s_date = '2023-02-25'
e_date = '2023-08-26'
df_6 = df_6.loc[df_6.index <= e_date]
train_period = df_6.loc[df_6.index < s_date]
test_period = df_6.loc[(df_6.index >= s_date) & (df_6.index <= e_date)]
start_forecast = datetime.strptime(s_date, '%Y-%m-%d').date()
end_forecast = datetime.strptime(e_date, '%Y-%m-%d').date()
We will create and resuse this function to measure each models performance with the data. This will include MAE, MSE, RMSE, MAPE and direct accuracy percentage.
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
def get_scores(actual, predicted):
# Calculate errors
mae = mean_absolute_error(actual, predicted)
mse = mean_squared_error(actual, predicted)
rmse = sqrt(mse)
percentage_diff = np.abs((actual - predicted) / actual) * 100
# Calculate MAPE
mape = percentage_diff.mean()
# Calculate "Accuracy" Percentage
accuracy_percentage = 100 - mape
# Print metrics
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"MAPE: {mape}%")
print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")
return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])
ARIMA is a statistical model used for time series analysis to forecast future data points by leveraging past data. It combines three main aspects: autoregression (AR), differencing (I) to make the time series stationary, and moving average (MA). The AR part exploits the relationship between an observation and a number of lagged observations, the I part involves differencing the data to achieve stationarity, and the MA part models the error of the observation as a combination of previous error terms.
from statsmodels.tsa.arima.model import ARIMA
# ARIMA Model fitting
order = (1, 1, 1) # Still (p, d, q)
model = ARIMA(train_period['UNIT_SALES'], order=order)
model_fit = model.fit()
# Forecasting
forecast = model_fit.predict(start=start_forecast,
end=end_forecast)
arima_forecast = forecast
arima_score = get_scores(test_period.squeeze(), arima_forecast)
MAE: 1331.9101181487447 MSE: 2785766.362578469 RMSE: 1669.0615215079608 MAPE: 7.579412439517313% Direct 'Accuracy' Percentage: 92.42058756048269%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(arima_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
SARIMA extends ARIMA by explicitly accommodating and modeling seasonal effects in time series data. It includes additional seasonal elements on top of the AR, I, and MA components. SARIMA is characterized by its ability to model both non-seasonal and seasonal components of the time series data, making it more versatile than ARIMA for data with clear seasonal patterns, such as sales data around specific holidays or events. It incorporates additional parameters to handle seasonality, which are seasonal AR, seasonal differencing, and seasonal MA components, allowing it to capture seasonal fluctuations effectively, making it ideal for products with seasonal demand.
from statsmodels.tsa.statespace.sarimax import SARIMAX
# SARIMA Model fitting
order = (1, 1, 1) # (p, d, q)
seasonal_order = (1, 1, 1, 52) # (P, D, Q, s)
model = SARIMAX(train_period['UNIT_SALES'], order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
# Forecasting
forecast = model_fit.predict(start=start_forecast.strftime('%Y-%m-%d'),
end=end_forecast.strftime('%Y-%m-%d'))
sarima_score = get_scores(test_period.squeeze(), sarima_forecast)
MAE: 2078.657764078491 MSE: 5833349.344575272 RMSE: 2415.2327723379526 MAPE: 12.472789935052829% Direct 'Accuracy' Percentage: 87.52721006494717%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(sarima_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
forecast looks good when compared to the actual sales.
Prophet is a forecasting tool designed by Facebook for handling time series data that displays patterns on different time scales such as yearly, weekly, and daily. It is especially useful for data with strong seasonal effects and several seasons of historical data. Prophet works by fitting nonlinear trends with yearly, weekly, and daily seasonality, plus holiday effects. It is robust to missing data and shifts in the trend, and typically requires no manual tuning of parameters. The model accommodates seasonality through Fourier series and includes components for holidays and special events, making it well-suited for predicting demand for products around specific events or holidays, like Easter.
from prophet import Prophet
df_mod = train_period.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']
model = Prophet(changepoint_prior_scale=.1,seasonality_prior_scale=.1, weekly_seasonality=True)
model.fit(df_mod)
# Forecast the next 52 weeks (1 year)
future = model.make_future_dataframe(periods=53, freq='W-SAT')
forecast = model.predict(future)
prophet_forecast = forecast[['ds','yhat']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)]
prophet_score = get_scores(test_period.squeeze(), prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)].squeeze())
MAE: 2573.16930571512 MSE: 9358981.759745052 RMSE: 3059.2452925100747 MAPE: 15.66648137188985% Direct 'Accuracy' Percentage: 84.33351862811016%
Plotting the actual vs forecast graph:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(prophet_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
Exponential Smoothing is a time series forecasting method for univariate data that applies smoothing factors to the observations, giving more weight to recent observations while not discarding older observations entirely. It encompasses simple exponential smoothing for data with no clear trend or seasonality, and extends to Holt’s linear trend model and Holt-Winters’ seasonal model, which can account for both trends and seasonality in the data. This method is straightforward and computationally efficient, making it a good choice for producing quick forecasts in situations where data patterns are reasonably consistent over time, but may struggle with data that has complex patterns or significant irregularities.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# ExponentialSmoothing Model fitting
model = ExponentialSmoothing(train_period['UNIT_SALES'], trend='add', seasonal='add', seasonal_periods=52).fit(smoothing_level=0.01, smoothing_slope=0.01, smoothing_seasonal=0.1)
# Forecasting
forecast_periods = ((end_forecast - start_forecast).days // 7) + 1
exponential_forecast = model.forecast(forecast_periods)
forecast_dates = pd.date_range(start=start_forecast, periods=forecast_periods, freq='W-SAT')
exponential_smoothing_score = get_scores(test_period.squeeze(), exponential_forecast)
MAE: 1672.5828724262556 MSE: 4512195.604211318 RMSE: 2124.192930082227 MAPE: 9.544250972143143% Direct 'Accuracy' Percentage: 90.45574902785685%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(exponential_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
pd.options.display.float_format = '{:.2f}'.format
q2_scores = pd.DataFrame({'Arima':arima_score, 'Sarima':sarima_score, 'Prophet':prophet_score, 'Exponential Smoothing':exponential_smoothing_score}).T
print(q2_scores)
mae mse rmse mape direct_accuracy Arima 1331.91 2785766.36 1669.06 7.58 92.42 Sarima 2078.66 5833349.34 2415.23 12.47 87.53 Prophet 2573.17 9358981.76 3059.25 15.67 84.33 Exponential Smoothing 1672.58 4512195.60 2124.19 9.54 90.46
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
fig.suptitle('Model Performance Metrics')
for i, column in enumerate(q2_scores.columns):
row, col = divmod(i, 3)
sns.barplot(ax=axes[row, col], x=q2_scores.index, y=q2_scores[column])
axes[row, col].set_title(column)
axes[row, col].set_ylabel('')
# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Based on the metrics provided, Exponential Smoothing is identified as the most suitable choice when considering a graphical similarity to the ARIMA model. It presents a good balance between accuracy and consistency in predictions, despite not having the lowest MAE, MSE, RMSE, or the highest Direct Accuracy. Specifically, its MAE of 1672.58 and MSE of 4512195.60, along with an RMSE of 2124.19, indicate a reasonable level of prediction error, which is notably closer to the performance metrics of ARIMA. Moreover, its MAPE value of 9.54% demonstrates a relatively low percentage error, indicating its efficiency in maintaining accuracy in forecasts. Although its Direct Accuracy of 90.46% is not the highest, it reflects a strong capability in accurately forecasting future values, especially when considering the importance of graphical similarity in the context of ARIMA comparisons. Thus, for ensuring a balance between graphical resemblance to ARIMA and maintaining a good level of forecast accuracy and reliability, Exponential Smoothing emerges as the optimal model.
# grouping by date
df_6 = swire_diet_multi_jug_2l_moonlit[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df_6['DATE'] = pd.to_datetime(df_6['DATE'])
df_6.set_index('DATE', inplace=True)
df_6 = df_6.loc[df_6.index <= e_date]
df_6 = df_6.asfreq('W-SAT')
start_forecast = datetime.strptime("2024-09-02", '%Y-%m-%d').date()
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# ExponentialSmoothing Model fitting
model = ExponentialSmoothing(df_6['UNIT_SALES'], trend='add', seasonal='add', seasonal_periods=52).fit(smoothing_level=0.01, smoothing_slope=0.01, smoothing_seasonal=0.1)
# Forecasting
forecast_periods = 26
exponential_forecast = model.forecast(forecast_periods)
forecast_dates = pd.date_range(start=start_forecast, periods=forecast_periods, freq='W-SAT')
# dispalying the 6month forecast
exponential_forecast
2023-09-02 16170.76 2023-09-09 15186.27 2023-09-16 15677.02 2023-09-23 16999.64 2023-09-30 16727.67 2023-10-07 15485.82 2023-10-14 15160.90 2023-10-21 16290.53 2023-10-28 15278.34 2023-11-04 15843.64 2023-11-11 15831.58 2023-11-18 15807.30 2023-11-25 17151.41 2023-12-02 16644.21 2023-12-09 17270.94 2023-12-16 16978.95 2023-12-23 18534.64 2023-12-30 17596.79 2024-01-06 17048.72 2024-01-13 15438.83 2024-01-20 15804.36 2024-01-27 15687.73 2024-02-03 16612.23 2024-02-10 16418.08 2024-02-17 15646.95 2024-02-24 16181.15 Freq: W-SAT, dtype: float64
visualizing the sales in a plot:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(exponential_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
The forecast for Swire's upcoming product, Diet Energy Moonlit Cassava 2L Multi Jug, was meticulously developed using a dataset tailored to match the product's characteristics: Diet caloric segment, Energy market category, manufactured by Swire-CC, and focusing on the 2L Multi Jug package type. The data was further refined to concentrate on historical sales data that aligns closely with the anticipated product profile, particularly emphasizing the Cassava flavor under the Diet Moonlit brand.
Four forecasting methodologies were evaluated for their effectiveness in predicting the product's demand over a 6-month period leading up to its launch: ARIMA, SARIMA, Prophet, and Exponential Smoothing. The assessment of these models was grounded in a comprehensive analysis using several key performance indicators: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Direct Accuracy Percentatory phase.24.
Exponential Smoothing was identified as the most apt model for this forecasting task, demonstrating superior performance across multiple metrics. Despite not achieving the lowest MAE and MSE scores in absolute terms, its performance was commendably close to that of the ARIMA model, with a MAE of 1672.58, MSE of 4512195.60, and RMSE of 2124.19. Its MAPE of 9.54% and a Direct Accuracy Percentage of 90.46% underscored its robustness and consistency in prediction accuracy, making it exceptionally well-suited for the task at hand. While the SARIMA and Prophet models offered valuable insights, they did not surpass the Exponential Smoothing model in overall efficacy. The ARIMA model, known for its reliability, fell slightly behind in this specific context.
The chosen Exponential Smoothing model then provided a forecast for the new Cassava-flavored beverage, projecting weekly sales for the next 6 months. This projection anticipates the product's market performance, guiding Swire in strategic planning for production, distribution, and promotional efforts. The model forecasts an initial weekly sales figure of 16,170.76 units, with variations over the 6-month period, peaking at certain points due to seasonal adjustments and market responses.
The Exponential Smoothing model stands out for its accuracy and reliability in forecasting sales for Swire's new Diet Energy Moonlit Cassava 2L Multi Jug. It offers a nuanced understanding of expected demand, capturing both the general trend and the seasonal fluctuations. These insights are crucial for Swire's preparation and strategy formulation, ensuring that production levels are optimally aligned with market demand, and marketing initiatives are effectively timed to maximize sales potential over the product's introductory phase.
Swire plans to release this product for 13 weeks, but only in one region. Which region would it perform best in?
Look at region wise distribution for two parameter - manufacturer & brand:
With regards to package type ".5L 12One Jug", above 3 (manufacturer + Brand + Category) filters set, we see that there are about 5 package types the beverages were sold. Over 90% of the sales have come from "18SMALL MULTI JUG" type.
Looks like package type ".5L 12One Jug" planned to be used for this innovative product were not used much by Greetingle & Swire and would not have any impact on identifying best region keeping technical & domain knowledge in mind and hence, we will neglect package type from further analysis.
On applying flavor ‘Woodsy Yellow’ + Regular + ING ENHANCED WATER, we see only COCOS manufacturer has experimented with this combination. Let's gauge the region wise sales:
Lets compare sales by region for Woodsy Yellow, overall sales vs ENH water vs Regular:
From all the analysis uptil this point, we have seen that Central, North West and South regions have consistently topped the charts be it Brand/Manufacturer/Caloric Segment/Flavor/Categor and even for a combination of these.
Lets deep dive into these 3 regions by time. Since we don't have to forecast or look at the absolute figures of sales, ideal chart in this case would be a ribbon chart over a regular line chart:
For Flavor alone: Central with little competition from South
Flavor + Regular: Central clear winner
Flavor + Regular + ENH Water: NW clear winner
Now for brand, brand + ENH Water:
For Brand alone : Central winner for the most part
Brand + ENH Water: Central again
From the previous two results, we can safely eliminate South region as it has some significant presence for flavor alone but when other parameter are filtered on top of it, then the sales dies rapidly compared to NW & Central regions.
Top Insights:
consumer_demographics_zip_region = consumer_demographics[['Zip', 'State','REGION']]
consumer_demographics_region = consumer_demographics_zip_region[(consumer_demographics_zip_region['REGION'] == 'Central') | (consumer_demographics_zip_region['REGION'] == 'North West') ]
consumer_demographics_region['REGION'].value_counts()
REGION North West 990 Central 800 Name: count, dtype: int64
# (consumer_demographics_region['Zip'].value_counts()>1).sum()
zip_cons = pd.merge(consumer_demographics_region, zip_to_market, left_on='Zip', right_on='ZIP_CODE', how='inner')
zip_cons = zip_cons.drop_duplicates(subset=['MARKET_KEY', 'REGION'])
merged_data = pd.merge(market_demand, zip_cons, on='MARKET_KEY', how='inner')
merged_data = merged_data.drop(['Zip','ZIP_CODE'], axis=1)
merged_data['State'].value_counts(normalize=True)
State CO 0.381271 WA 0.263675 OR 0.228770 UT 0.126284 Name: proportion, dtype: float64
# Group by 'State' and calculate summary statistics for 'UNIT_SALES'
unit_sales_summary_by_state = merged_data.groupby('State')['UNIT_SALES'].describe()
unit_sales_summary_by_state
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| State | ||||||||
| CO | 4409784.0 | 158.650056 | 426.420239 | 0.04 | 11.0 | 43.0 | 133.0 | 11219.0 |
| OR | 2645953.0 | 156.398806 | 356.790827 | 0.04 | 13.0 | 49.0 | 144.0 | 11100.0 |
| UT | 1460597.0 | 196.563396 | 471.057990 | 0.04 | 13.0 | 51.0 | 167.0 | 14773.0 |
| WA | 3049667.0 | 145.675230 | 331.236478 | 0.04 | 12.0 | 48.0 | 138.0 | 9243.0 |
While Utah has the least number of sales count, it has the highest average sales per order. The max values suggests strong possibility of outliers or very large orders.
Note: Since, it was informed that the dataset is cleaned, we will not analyze further and remove those records.
print("unit sales:")
region_sales_stats = merged_data.groupby('REGION')['UNIT_SALES'].describe()
print(region_sales_stats)
# Group the data by 'REGION' and compute summary statistics for 'DOLLAR_SALES'
region_sales_summary = merged_data.groupby('REGION')['DOLLAR_SALES'].describe()
print("dollar sales:")
# Display the summary statistics
print(region_sales_summary)
unit sales:
count mean std min 25% 50% 75% max
REGION
Central 5870381.0 168.083194 438.258413 0.04 12.0 45.0 140.0 14773.0
North West 5695620.0 150.656967 343.386241 0.04 13.0 49.0 141.0 11100.0
dollar sales:
count mean std min 25% 50% 75% max
REGION
Central 5870381.0 575.904097 1464.522706 0.01 40.16 156.31 500.48 69472.26
North West 5695620.0 487.903103 1122.230124 0.01 44.68 159.52 458.65 46137.39
No significant difference for the unit sales, however Central region has more dollar sales worth of beverages sold over NW in the overall dataset.
merged_data['DATE'] = pd.to_datetime(merged_data['DATE'])
earliest_date = merged_data['DATE'].min()
# Step 2: Calculate week numbers
merged_data['WEEK_NUMBER'] = ((merged_data['DATE'] - earliest_date).dt.days // 7) % 52
# Filter data for Central region
central = merged_data[merged_data['REGION'] == 'Central']
# Filter data for Northwest region
northwest = merged_data[merged_data['REGION'] == 'North West']
Now the plan of action here is similar to how we had two batch of filters applied for the visuals at top for this question, we will go ahead and filter the data accordingly and then build a prophet model to compare best 13 weeks sales region wise. This will help us arrive at the best region with statistical analysis support.
# Step 3: Aggregate the data for every week
northwest_fil_1 = northwest[(northwest['BRAND']=='GREETINGLE') & (northwest['CATEGORY']=='ING ENHANCED WATER')]
northwest_fil_1 = northwest_fil_1.groupby('DATE')['UNIT_SALES'].sum().reset_index()
# central_fil_1 = northwest[(northwest['BRAND']=='GREETINGLE') & (northwest['CATEGORY']=='ING ENHANCED WATER')]
# weekly_data = northwest_fil_1.groupby('WEEK_NUMBER')['UNIT_SALES'].sum().reset_index()
northwest_fil_1.tail(1)
| DATE | UNIT_SALES | |
|---|---|---|
| 147 | 2023-10-28 | 19875.67 |
The last date is 2023-10-28
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
# Initialize and fit the Prophet model
p_model = Prophet()
nw_model_data1 = northwest_fil_1.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
p_model.fit(nw_model_data1)
# Create future dates for forecasting
future = p_model.make_future_dataframe(periods=52, freq='W')
# Make predictions
forecast = p_model.predict(future)
# Plot the forecast
plot_plotly(p_model, forecast)
Observing a negative trend. Filtering the forecast of last 1 year:
# week wise sales for best 13 weeks
nw_fil1_forecast_1year = forecast[forecast['ds'] > '2023-10-28']
earliest_date = nw_fil1_forecast_1year['ds'].min()
nw_fil1_forecast_1year
# Step 2: Calculate week numbers
nw_fil1_forecast_1year['WEEK_NUMBER'] = ((nw_fil1_forecast_1year['ds'] - earliest_date).dt.days // 7) % 52
weekly_forecast_data = nw_fil1_forecast_1year.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
Taking rolling sum 13 week aggregate wise:
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data) < 13:
print("Insufficient data to compute rolling sum.")
else:
# Initialize an empty list to store the results
rolling_sales_sum = []
# Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
for i in range(len(weekly_forecast_data) - 12):
# Calculate the sum of UNIT_SALES for the current 13-week interval
sales_sum = weekly_forecast_data.loc[i:i+12, 'yhat'].sum()
# Construct the 13-week interval string
interval = f"{weekly_forecast_data.iloc[i]['WEEK_NUMBER']}-{weekly_forecast_data.iloc[i+12]['WEEK_NUMBER']}"
# Append the result to the list
rolling_sales_sum.append((sales_sum, interval))
# Create a new DataFrame from the results
rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
plotting:
Cell In[23], line 1 plotting: ^ SyntaxError: invalid syntax
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(70, -20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
Best 13 weeks obtained is 29-41 with a sale figure of 300k.
total_fil1_nw = weekly_forecast_data['yhat'].sum()
total_13_week_sum = rolling_sales_df[rolling_sales_df['13_WEEK_INTERVAL']== '29.0-41.0'].sum()['SALES_13W_SUM']
percentage = total_13_week_sum*100/total_fil1_nw
print(round(percentage,2))
30.62
Best 13 weeks accounts for 30.62% of the 1 year sales. Lets do the same for central region now:
central_fil_1 = central[(central['BRAND']=='GREETINGLE') & (central['CATEGORY']=='ING ENHANCED WATER')]
central_fil_1 = central_fil_1.groupby('DATE')['UNIT_SALES'].sum().reset_index()
central_fil_1.tail(1)
| DATE | UNIT_SALES | |
|---|---|---|
| 147 | 2023-10-28 | 21070.0 |
The last date is 2023-10-28
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
# Initialize and fit the Prophet model
p_model_1 = Prophet()
cw_model_data1 = central_fil_1.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
p_model_1.fit(cw_model_data1)
# Create future dates for forecasting
future_1 = p_model_1.make_future_dataframe(periods=52, freq='W')
# Make predictions
forecast_1 = p_model_1.predict(future_1)
# Plot the forecast
plot_plotly(p_model_1, forecast_1)
# week wise sales for best 13 weeks
cw_fil1_forecast_1year = forecast_1[forecast_1['ds'] > '2023-10-28']
earliest_date_1 = cw_fil1_forecast_1year['ds'].min()
# Step 2: Calculate week numbers
cw_fil1_forecast_1year['WEEK_NUMBER'] = ((cw_fil1_forecast_1year['ds'] - earliest_date_1).dt.days // 7) % 52
weekly_forecast_data_1 = cw_fil1_forecast_1year.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data_1) < 13:
print("Insufficient data to compute rolling sum.")
else:
# Initialize an empty list to store the results
rolling_sales_sum = []
# Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
for i in range(len(weekly_forecast_data_1) - 12):
# Calculate the sum of UNIT_SALES for the current 13-week interval
sales_sum = weekly_forecast_data_1.loc[i:i+12, 'yhat'].sum()
# Construct the 13-week interval string
interval = f"{weekly_forecast_data_1.iloc[i]['WEEK_NUMBER']}-{weekly_forecast_data_1.iloc[i+12]['WEEK_NUMBER']}"
# Append the result to the list
rolling_sales_sum.append((sales_sum, interval))
# Create a new DataFrame from the results
rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(70, -20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
total_fil1_central = weekly_forecast_data_1['yhat'].sum()
total_13_week_sum_central = rolling_sales_df[rolling_sales_df['13_WEEK_INTERVAL']== '25.0-37.0'].sum()['SALES_13W_SUM']
percentage = total_13_week_sum_central*100/total_fil1_central
print(round(percentage,2))
31.24
Best 13 weeks accounts for 31.24% of the 1 year sales.
northwest = merged_data[merged_data['REGION'] == 'North West']
woodsy_yellow = northwest[(northwest['ITEM'].str.contains('WOODSY YELLOW ', case=False, regex=True))]
woodsy_yellow_regular = woodsy_yellow[woodsy_yellow['CALORIC_SEGMENT'] == 'REGULAR']
woodsy_yellow_regular_enhanced_water = woodsy_yellow_regular[woodsy_yellow_regular['CATEGORY'] == 'ING ENHANCED WATER']
woodsy_yellow_regular_enhanced_water['MANUFACTURER'].value_counts()
MANUFACTURER COCOS 12285 Name: count, dtype: int64
northwest_woodsy = woodsy_yellow_regular_enhanced_water.groupby('DATE')['UNIT_SALES'].sum().reset_index()
northwest_woodsy.tail(1)
| DATE | UNIT_SALES | |
|---|---|---|
| 147 | 2023-10-28 | 9717.0 |
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
# Initialize and fit the Prophet model
p_model_3 = Prophet()
northwest_woodsy = northwest_woodsy.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
p_model_3.fit(northwest_woodsy)
# Create future dates for forecasting
future_3 = p_model_3.make_future_dataframe(periods=52, freq='W')
# Make predictions
forecast_3 = p_model_3.predict(future_3)
# Plot the forecast
plot_plotly(p_model_3, forecast_3)
# week wise sales for best 13 weeks
northwest_woodsy_forecast_1year = forecast_3[forecast_3['ds'] > '2023-10-28']
earliest_date_4 = northwest_woodsy_forecast_1year['ds'].min()
# Step 2: Calculate week numbers
northwest_woodsy_forecast_1year['WEEK_NUMBER'] = ((northwest_woodsy_forecast_1year['ds'] - earliest_date_4).dt.days // 7) % 52
weekly_forecast_data_northwest_woodsy = northwest_woodsy_forecast_1year.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data_northwest_woodsy) < 13:
print("Insufficient data to compute rolling sum.")
else:
# Initialize an empty list to store the results
rolling_sales_sum = []
# Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
for i in range(len(weekly_forecast_data_northwest_woodsy) - 12):
# Calculate the sum of UNIT_SALES for the current 13-week interval
sales_sum = weekly_forecast_data_northwest_woodsy.loc[i:i+12, 'yhat'].sum()
# Construct the 13-week interval string
interval = f"{weekly_forecast_data_northwest_woodsy.iloc[i]['WEEK_NUMBER']}-{weekly_forecast_data_northwest_woodsy.iloc[i+12]['WEEK_NUMBER']}"
# Append the result to the list
rolling_sales_sum.append((sales_sum, interval))
# Create a new DataFrame from the results
rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(70, -20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
total_fil1_northwest_woodsy = weekly_forecast_data_northwest_woodsy['yhat'].sum()
total_13_week_sum_northwest_woodsy = rolling_sales_df[rolling_sales_df['13_WEEK_INTERVAL']== '30.0-42.0'].sum()['SALES_13W_SUM']
percentage = total_13_week_sum_northwest_woodsy*100/total_fil1_northwest_woodsy
print(round(percentage,2))
30.98
Best 13 weeks accounts for 30.98% of the 1 year sales. Lets do the same for Central region and South regions
central = merged_data[merged_data['REGION'] == 'Central']
woodsy_yellow_central = central[(central['ITEM'].str.contains('WOODSY YELLOW ', case=False, regex=True))]
woodsy_yellow_central_regular = woodsy_yellow_central[woodsy_yellow_central['CALORIC_SEGMENT'] == 'REGULAR']
woodsy_yellow_central_regular_enhanced_water = woodsy_yellow_central_regular[woodsy_yellow_central_regular['CATEGORY'] == 'ING ENHANCED WATER']
woodsy_yellow_central_regular_enhanced_water['MANUFACTURER'].value_counts()
MANUFACTURER COCOS 11438 Name: count, dtype: int64
central_woodsy = woodsy_yellow_central_regular_enhanced_water.groupby('DATE')['UNIT_SALES'].sum().reset_index()
central_woodsy.tail(1)
| DATE | UNIT_SALES | |
|---|---|---|
| 147 | 2023-10-28 | 7127.0 |
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
# Initialize and fit the Prophet model
p_model_4 = Prophet()
central_woodsy = central_woodsy.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
p_model_4.fit(central_woodsy)
# Create future dates for forecasting
future_5 = p_model_4.make_future_dataframe(periods=52, freq='W')
# Make predictions
forecast_5 = p_model_4.predict(future_5)
# Plot the forecast
plot_plotly(p_model_4, forecast_5)
# week wise sales for best 13 weeks
central_woodsy_forecast_1year = forecast_5[forecast_5['ds'] > '2023-10-28']
earliest_date_6 = central_woodsy_forecast_1year['ds'].min()
# Step 2: Calculate week numbers
central_woodsy_forecast_1year['WEEK_NUMBER'] = ((central_woodsy_forecast_1year['ds'] - earliest_date_6).dt.days // 7) % 52
weekly_forecast_data_central_woodsy = central_woodsy_forecast_1year.groupby('WEEK_NUMBER')['yhat'].sum().reset_index()
# Ensure that there are enough rows to compute the rolling sum
if len(weekly_forecast_data_central_woodsy) < 13:
print("Insufficient data to compute rolling sum.")
else:
# Initialize an empty list to store the results
rolling_sales_sum = []
# Iterate over the DataFrame to calculate the rolling sum for each 13-week interval
for i in range(len(weekly_forecast_data_central_woodsy) - 12):
# Calculate the sum of UNIT_SALES for the current 13-week interval
sales_sum = weekly_forecast_data_central_woodsy.loc[i:i+12, 'yhat'].sum()
# Construct the 13-week interval string
interval = f"{weekly_forecast_data_central_woodsy.iloc[i]['WEEK_NUMBER']}-{weekly_forecast_data_central_woodsy.iloc[i+12]['WEEK_NUMBER']}"
# Append the result to the list
rolling_sales_sum.append((sales_sum, interval))
# Create a new DataFrame from the results
rolling_sales_df = pd.DataFrame(rolling_sales_sum, columns=['SALES_13W_SUM', '13_WEEK_INTERVAL'])
import matplotlib.pyplot as plt
# Assuming rolling_sales_df contains the DataFrame with 'SALES_13W_SUM' and '13_WEEK_INTERVAL' columns
# Extract data for plotting
sales_sum = rolling_sales_df['SALES_13W_SUM']
interval = rolling_sales_df['13_WEEK_INTERVAL']
# Plot the rolling sum
plt.figure(figsize=(18, 6))
plt.plot(interval, sales_sum, marker='o', linestyle='-')
plt.xlabel('13-Week Interval')
plt.ylabel('Sales Sum')
plt.title('Rolling Sum of Sales over 13-Week Intervals')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Find the index of the maximum value
max_index = sales_sum.idxmax()
max_value = sales_sum[max_index]
max_interval = interval[max_index]
# Annotate the peak with a red marker and the corresponding 13-week interval
plt.annotate(f'Max: {max_value} ({max_interval})',
xy=(max_interval, max_value),
xytext=(70, -20),
textcoords='offset points',
arrowprops=dict(facecolor='red', arrowstyle='->'),
fontsize=10,
color='red',
ha='center')
plt.tight_layout()
plt.show()
total_fil1_central_woodsy = weekly_forecast_data_central_woodsy['yhat'].sum()
total_13_week_sum_northwest_central = rolling_sales_df[rolling_sales_df['13_WEEK_INTERVAL']== '32.0-44.0'].sum()['SALES_13W_SUM']
percentage = total_13_week_sum_northwest_central*100/total_fil1_central_woodsy
print(round(percentage,2))
31.14
Best 13 weeks accounts for 31.14% of the 1 year sales. Lets do the same for central region now:
import pandas as pd
data = {
'Region': ['NorthWest', 'Central', 'NorthWest', 'Central'],
'Filter': ['Brand + Category', 'Brand + Category ', 'Flavor + Caloric + Category ', 'Flavor + Caloric + Category '],
'Best 13 Weeks (Forecasted)': ['29 - 41', '25 - 37', '30 - 42', '32 - 44'],
'Absolute Value (Unit Sales)': [300731, 272066, 173096, 108107],
'Absolute Value Percentage for total weeks': ['30.62%', '31.24%', '30.98%', '31.14%']
}
df = pd.DataFrame(data)
df
| Region | Filter | Best 13 Weeks (Forecasted) | Absolute Value (Unit Sales) | Absolute Value Percentage for total weeks | |
|---|---|---|---|---|---|
| 0 | NorthWest | Brand + Category | 29 - 41 | 300731 | 30.62% |
| 1 | Central | Brand + Category | 25 - 37 | 272066 | 31.24% |
| 2 | NorthWest | Flavor + Caloric + Category | 30 - 42 | 173096 | 30.98% |
| 3 | Central | Flavor + Caloric + Category | 32 - 44 | 108107 | 31.14% |
Item Description: Peppy Gentle Drink Pink Woodsy .5L Multi Jug
Swire plans to release this product in the Southern region for 13 weeks. What will the forecasted demand be, in weeks, for this product?
southern_keys = [
259, 260, 524, 272, 531, 277, 534, 535, 793, 794, 799, 801, 294, 807, 44,
557, 310, 825, 316, 831, 323, 585, 849, 862, 866, 867, 358, 365, 622, 367,
882, 891, 893, 895, 900, 913, 409, 673, 674, 939, 179, 949, 951, 952, 953,
197, 967, 205, 210, 979, 212, 980, 981, 216, 729, 220, 733, 478, 739, 743,
746, 748, 238, 754, 759, 766
]
# Filter the DataFrame
southern = market_demand[market_demand['MARKET_KEY'].isin(southern_keys)]
peppy = southern[(southern['BRAND'].str.contains('peppy', case=False, regex=True))]
peppy_regular = peppy[(peppy['CALORIC_SEGMENT'].str.contains('Regular', case=False, regex=True))]
peppy_regular_ssd = peppy_regular[(peppy_regular['CALORIC_SEGMENT'].str.contains('Regular', case=False, regex=True))]
peppy_regular_ssd_multi_jug = peppy_regular_ssd[(peppy_regular_ssd['PACKAGE'].str.contains('Multi Jug', case=False, regex=True))]
df = peppy_regular_ssd_multi_jug[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df = df.asfreq('W-SAT')
plt.plot(df)
[<matplotlib.lines.Line2D at 0x290d3368b90>]
There are NAN in some weeks. as we dig deeper we see that there are no sale on the month of august 2023.
# Train and Test periods
from datetime import datetime
s_date = '2023-06-03'
e_date = '2023-08-26'
df = df.loc[df.index <= e_date]
train_period = df.loc[df.index < s_date]
test_period = df.loc[(df.index >= s_date) & (df.index <= e_date)]
start_forecast = datetime.strptime(s_date, '%Y-%m-%d').date()
end_forecast = datetime.strptime(e_date, '%Y-%m-%d').date()
We will create and resuse this function to measure each models performance with the data. This will include MAE, MSE, RMSE, MAPE and direct accuracy percentage.
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
import numpy as np
def get_scores(actual, predicted):
# Calculate errors
mae = mean_absolute_error(actual, predicted)
mse = mean_squared_error(actual, predicted)
rmse = sqrt(mse)
percentage_diff = np.abs((actual - predicted) / actual) * 100
# Calculate MAPE
mape = percentage_diff.mean()
# Calculate "Accuracy" Percentage
accuracy_percentage = 100 - mape
# Print metrics
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"MAPE: {mape}%")
print(f"Direct 'Accuracy' Percentage: {accuracy_percentage}%")
return pd.Series(data={'mae':mae, 'mse':mse, 'rmse':rmse, 'mape':mape, 'direct_accuracy':accuracy_percentage}, index=['mae', 'mse', 'rmse','mape','direct_accuracy'])
ARIMA is a statistical model used for time series analysis to forecast future data points by leveraging past data. It combines three main aspects: autoregression (AR), differencing (I) to make the time series stationary, and moving average (MA). The AR part exploits the relationship between an observation and a number of lagged observations, the I part involves differencing the data to achieve stationarity, and the MA part models the error of the observation as a combination of previous error terms.
from statsmodels.tsa.arima.model import ARIMA
# ARIMA Model fitting
order = (2, 1, 2) # Still (p, d, q)
model = ARIMA(train_period['UNIT_SALES'], order=order)
model_fit = model.fit()
# Forecasting
forecast = model_fit.predict(start=start_forecast,
end=end_forecast)
arima_forecast = forecast
arima_score = get_scores(test_period.squeeze(), arima_forecast)
MAE: 10610.921900092897 MSE: 133607735.47205538 RMSE: 11558.881237907732 MAPE: 3.682049809516563% Direct 'Accuracy' Percentage: 96.31795019048344%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(arima_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
SARIMA extends ARIMA by explicitly accommodating and modeling seasonal effects in time series data. It includes additional seasonal elements on top of the AR, I, and MA components. SARIMA is characterized by its ability to model both non-seasonal and seasonal components of the time series data, making it more versatile than ARIMA for data with clear seasonal patterns, such as sales data around specific holidays or events. It incorporates additional parameters to handle seasonality, which are seasonal AR, seasonal differencing, and seasonal MA components, allowing it to capture seasonal fluctuations effectively, making it ideal for products with seasonal demand.
from statsmodels.tsa.statespace.sarimax import SARIMAX
# SARIMA Model fitting
order = (1, 1, 1) # (p, d, q)
seasonal_order = (1, 1, 1, 52) # (P, D, Q, s)
model = SARIMAX(train_period['UNIT_SALES'], order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
# Forecasting
forecast = model_fit.predict(start=start_forecast.strftime('%Y-%m-%d'),
end=end_forecast.strftime('%Y-%m-%d'))
sarima_score = get_scores(test_period.squeeze(), sarima_forecast)
MAE: 24348.214490805636 MSE: 801042551.4448053 RMSE: 28302.695126874496 MAPE: 8.420642444814801% Direct 'Accuracy' Percentage: 91.5793575551852%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(sarima_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
Prophet is a forecasting tool designed by Facebook for handling time series data that displays patterns on different time scales such as yearly, weekly, and daily. It is especially useful for data with strong seasonal effects and several seasons of historical data. Prophet works by fitting nonlinear trends with yearly, weekly, and daily seasonality, plus holiday effects. It is robust to missing data and shifts in the trend, and typically requires no manual tuning of parameters. The model accommodates seasonality through Fourier series and includes components for holidays and special events, making it well-suited for predicting demand for products around specific events or holidays, like Easter.
from prophet import Prophet
df_mod = train_period.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']
model = Prophet(changepoint_prior_scale=0.1,seasonality_prior_scale=0.5, weekly_seasonality=True)
model.fit(df_mod)
# Forecast the next 52 weeks (1 year)
future = model.make_future_dataframe(periods=53, freq='W-SAT')
forecast = model.predict(future)
prophet_forecast = forecast[['ds','yhat']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)]
prophet_score = get_scores(test_period.squeeze(), prophet_forecast.loc[(prophet_forecast.index >= s_date) & (prophet_forecast.index <= e_date)].squeeze())
MAE: 14082.66495499823 MSE: 257837638.2131654 RMSE: 16057.32350714668 MAPE: 4.88229507297955% Direct 'Accuracy' Percentage: 95.11770492702045%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(prophet_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
Exponential Smoothing is a time series forecasting method for univariate data that applies smoothing factors to the observations, giving more weight to recent observations while not discarding older observations entirely. It encompasses simple exponential smoothing for data with no clear trend or seasonality, and extends to Holt’s linear trend model and Holt-Winters’ seasonal model, which can account for both trends and seasonality in the data. This method is straightforward and computationally efficient, making it a good choice for producing quick forecasts in situations where data patterns are reasonably consistent over time, but may struggle with data that has complex patterns or significant irregularities.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# ExponentialSmoothing Model fitting
model = ExponentialSmoothing(train_period['UNIT_SALES'], trend='add', seasonal='add', seasonal_periods=52).fit(smoothing_level=0.6, smoothing_slope=0.2, smoothing_seasonal=0.2)
# Forecasting
forecast_periods = ((end_forecast - start_forecast).days // 7) + 1
exponential_forecast = model.forecast(forecast_periods)
forecast_dates = pd.date_range(start=start_forecast, periods=forecast_periods, freq='W-SAT')
exponential_smoothing_score = get_scores(test_period.squeeze(), exponential_forecast)
MAE: 23907.5251431621 MSE: 744489010.3472316 RMSE: 27285.325916089616 MAPE: 8.281600968319658% Direct 'Accuracy' Percentage: 91.71839903168035%
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(test_period['UNIT_SALES'], label='Actual Sales')
plt.plot(exponential_forecast, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
pd.options.display.float_format = '{:.2f}'.format
q2_scores = pd.DataFrame({'Arima':arima_score, 'Sarima':sarima_score, 'Prophet':prophet_score, 'Exponential Smoothing':exponential_smoothing_score}).T
print(q2_scores)
mae mse rmse mape direct_accuracy Arima 10610.92 133607735.47 11558.88 3.68 96.32 Sarima 24348.21 801042551.44 28302.70 8.42 91.58 Prophet 14082.66 257837638.21 16057.32 4.88 95.12 Exponential Smoothing 23907.53 744489010.35 27285.33 8.28 91.72
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
fig.suptitle('Model Performance Metrics')
for i, column in enumerate(q2_scores.columns):
row, col = divmod(i, 3)
sns.barplot(ax=axes[row, col], x=q2_scores.index, y=q2_scores[column])
axes[row, col].set_title(column)
axes[row, col].set_ylabel('') # Removing the y-label for cleanliness
# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Based on the updated metrics, the Prophet model appears to be the most suitable choice for this scenario, especially when considering the graph resemblance to actual values as a crucial criterion. Although its MAE, MSE, and RMSE values are not the lowest among the compared models, they indicate a reasonable accuracy level in predictions close to the actual values. The Prophet model's MAPE is relatively low, suggesting that its predictions' percentage errors are acceptable, contributing to its overall forecast reliability. Furthermore, its Direct Accuracy is impressively high, only slightly below the highest, showcasing its effectiveness in closely matching the actual trends. Given the importance of graph resemblance to actuals for this analysis, the Prophet model, with its balance between statistical accuracy and visual trend matching, is deemed the most appropriate choice.
df = peppy_regular_ssd_multi_jug[['DATE','UNIT_SALES']].groupby(by="DATE", as_index=False).sum()
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df = df.loc[df.index <= "2023-08-26"]
df = df.asfreq('W-SAT')
from prophet import Prophet
df_mod = df.reset_index()
df_mod['DATE'] = pd.to_datetime(df_mod['DATE'])
df_mod.columns = ['ds', 'y']
model = Prophet(changepoint_prior_scale=0.1,seasonality_prior_scale=0.5, weekly_seasonality=True)
model.fit(df_mod)
future = model.make_future_dataframe(periods=13, freq='W-SAT')
forecast = model.predict(future)
prophet_forecast = forecast[['ds','yhat','yhat_upper','yhat_lower']]
prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast.columns = ['DATE', 'yhat','yhat_upper','yhat_lower']
prophet_forecast.set_index('DATE', inplace=True)
prophet_forecast = prophet_forecast.asfreq('W-SAT')
prophet_forecast = prophet_forecast.loc[(prophet_forecast.index >= "2023-09-02")]
20:33:04 - cmdstanpy - INFO - Chain [1] start processing 20:33:04 - cmdstanpy - INFO - Chain [1] done processing C:\Users\Michael Mendoza\AppData\Local\Temp\ipykernel_19324\963624670.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy prophet_forecast['ds'] = pd.to_datetime(prophet_forecast['ds'])
prophet_forecast
| yhat | yhat_upper | yhat_lower | |
|---|---|---|---|
| DATE | |||
| 2023-09-02 | 285476.40 | 298778.29 | 271838.69 |
| 2023-09-09 | 288354.50 | 300856.25 | 276032.32 |
| 2023-09-16 | 289057.31 | 302472.57 | 274802.73 |
| 2023-09-23 | 287248.53 | 300958.92 | 273797.52 |
| 2023-09-30 | 285106.30 | 298506.32 | 271294.70 |
| 2023-10-07 | 284271.78 | 298363.42 | 271306.87 |
| 2023-10-14 | 283922.05 | 297361.37 | 270622.10 |
| 2023-10-21 | 282352.50 | 296781.29 | 268788.65 |
| 2023-10-28 | 279677.03 | 294189.70 | 266454.82 |
| 2023-11-04 | 277865.93 | 290756.87 | 264269.66 |
| 2023-11-11 | 278087.04 | 290981.57 | 264992.78 |
| 2023-11-18 | 279119.81 | 293013.24 | 265333.38 |
| 2023-11-25 | 279277.11 | 293360.46 | 265840.58 |
plt.figure(figsize=(10, 6))
plt.plot(prophet_forecast.index, prophet_forecast['yhat'], label='Predicted', color='blue', marker='o')
plt.fill_between(prophet_forecast.index, prophet_forecast['yhat_lower'], prophet_forecast['yhat_upper'], color='gray', alpha=0.2, label='Confidence Interval')
plt.title('Forecast with Confidence Interval')
plt.xlabel('Date')
plt.ylabel('Forecast Value')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The forecasting analysis for Swire's upcoming beverage, the Peppy Gentle Drink in "Pink Woodsy" flavor, packaged in a .5L Multi Jug, focuses on the product's demand projection for a 13-week span around the Southern region's Easter 2024 period. The data was meticulously curated to match the product's attribute,, caloric segment, type, manufacturer, and packaging, from a comprehensive market demand dataset. This strategic approach allowed for an accurate representation and analysis relevant to the product's market introduction.
Seveted forecasting models were employed, including ARIMA, SARIMA, Prophet, and Exponential Smoothing, each evaluated against critical performance metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Direct 'Accuracy' Percentage. These metrics provided a comprehensive assessment of each model's accuracy and reliability in predicting future sales demand.
The Prophet model emerged as the most fitting choice for this specific forecasting task. While its MAE, MSE, and RMSE values did not register as the absolute lowest among the models evaluated, they were nonetheless indicative of a commendably high level of prediction accuracy. The Prophet model's relatively low MAPE and high Direct Accuracy Percentage further affirmed its efficacy in closely mirroring actual sales trends, a crucial factor for this analysis.
The demand forecast through the Prophet model reveals an anticipated fluctuation in sales over the target period, with projections highlighting key trends and potential peak demand times around Easter 2024. This predictive insight is invaluable for Swire in strategizing production, inventory, and marketing efforts to optimally meet anticipated consumer demand. By effectively leveraging the Prophet model's forecast, Swire can ensure product availability aligns with market demand, maximizing sales potential and customer satisfaction during this critical launch
Michael: Q2, Q6 and Q7 Modelling with interpretations, Model evaluation metrics, ARIMA & SARIMA
Rawali: Q3 Analysis with Interpretations, Model Validation & evaluation
Neil: Q4 Analysis with Interpretations, Q5 Modelling, Notebook final edits
Abinav: Q1 Modelling with Interpretations, Q5 Visualizations, Prophet model, Writing part for Introduction, Data Preparation, Approach & assumptions.